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Preface 



This book is the final outcome of VECPAR 2000 - 4th International Meeting on 
Vector and Parallel Processing. 

VECPAR constitutes a series of conferences, which have been organized by 
the Faculty of Engineering of the University of Porto since 1993, with the main 
objective of disseminating new knowledge on parallel computing. 



Readership of This Book 

The book is aimed at an audience of researchers and graduate students in a 
broad range of scientific areas, including not only computer science, but also 
applied mathematics and numerical analysis, physics, and engineering. 



Book Plan 

From a total of 66 papers selected on the basis of extended abstracts for pre- 
sentation at the conference, a subset of 34 papers were chosen during a second 
review process leading to their inclusion in the book, together with the invited 
talks. The book contains a total of 40 papers organized into 6 chapters, where 
each may appeal to people in different but still related scientific areas. All chap- 
ters, with the exception of Chapter 6, are initiated by a short text, providing a 
quick overview of the organization and papers in the chapter. 

The 13 papers in Chapter 1 cover the aspects related to the use of multiple 
processors. Operating systems, languages and software tools for scheduling, and 
code transformation are the topics included in this chapter, initiated by the talk 
on computing over the Internet, entitled Grid Computing^ by Ian Foster. 

Dieterich Stauffer’s invited talk. Cellular Automata: Applications, opens 
Chapter 2, which deals mainly with problems of interest to computational physics. 
Cellular automata, algorithms based on the Monte-Carlo method, and simula- 
tion of collision-free plasma and the radial Schrodinger equation are the topics 
covered by the 6 selected papers in this chapter. 

The majority of the papers in Chapter 3 are related to linear and non-linear 
algebra. The invited talk by Mark Stadtherr, Parallel Branch- and- Bound for 
Chemical Engineering Applications: Load Balancing and Scheduling Issues, ad- 
dresses a common situation in the simulation of chemical process engineering, 
very often requiring the solution of highly nonlinear problems with many solu- 
tions. The 9 selected papers, though varying in their field of application cover, in 
general, variants of well-known algorithms, here taking advantage of the matrix 
structure and computer architecture. 

In Chapter 4, Michael Duff reviews in his invited talk the years of research 
in image processing. Apart from the invited talk. Thirty Years of Parallel Image 
Processing, this chapter comprises 3 more papers on image processing, related 
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with a medical application, an algorithm for particle tracking in the experimen- 
tal high-energy particle physics, and the design of an image processing parallel 
algorithm for orthogonal multiprocessors systems. 

Chapter 5, centered on the finite/discrete element technique, includes the 
invited talk by David R. Owen, entitled Finite /Discrete Element Analysis of 
Multi-fracture and Multi-contact Phenomena, and 4 more papers also related to 
the finite/discrete element technique. 

Chapter 6 is home to the invited talk, by Ugo Piomelli, entitled Large Eddy 
Simulation of Turbulent Flows, from Desktop to Supercomputers. Piomelli shows 
us new results, made possible by joint utilization of parallel processing and desk- 
top computers. This brings the book to a closure. 



Student Papers 

The Student Paper Award, first included in the conference program in 1998, has 
given impetus to our objective of promoting the participation of students and 
providing them with the stimulus for a fulfilling research activity. It is our wish 
that the standards of quality achieved by the two recipients can be maintained 
for all their future years as researchers. 

— Student Paper Award (First Prize) 

• Parallel Image Processing System on a Cluster of Personal Computers 
by Jorge Barbosa, and also co-authored by J. Tavares and A.J. Padilha 
(Portugal) 

— Student Paper Award (Honourable Mention) 

• Suboptimal Communication Schedule for GEN_BLOCK Redistribution 
by Hyun-Gyoo Yook, and also co-authored by Myong-Soon Park (Korea) 

We benefitted from the generous help of all scientific committee members and 
additional reviewers, in our efforts geared towards the assembling of a conference 
program and the edition of a post-conference book, which, we hope, can be of 
value for some years to come. 
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Chapter 1: 

Computational Grids, Languages, and Tools in 
Multiplatform Environments 



Introduction 



This chapter comprises the invited talk, entitled Computational Grids, by Ian 
Foster, and 12 selected papers, covering the practical aspects of computer over 
multi-platform environments, such as operating systems, languages, and software 
tools for code transformation. 

The first paper, by Forkert et ah, describes a component-based framework, 
called TENT, for the integration of tools in a computer-aided engineering envi- 
ronment, which supports the most common programming languages, operating 
systems, and hardware architectures; examples of applications are shown in the 
cases of airplane design and aircraft turbine design. 

Rischbeck and Watson present the design of a server, based on a scalable 
parallel virtual reality modeling language (VRML97) and with the power to 
support complex worlds with thousands of possible interacting users. The pro- 
gramming model and runtime system used for the implementation of a VRML 
server, enables, for instance, dynamic load balancing through active object mi- 
gration during runtime and multiplexing of active objects onto threads. 

Walker has developed a parallel implementation of a simulator to emulate 
current web search engine indexers; genetic programming is the key feature be- 
hind this, providing the mechanisms for storage, processing, and retrieval of 
information for web crawlers. 

Scheduling is the subject of the next 4 papers in the chapter. Solsona et 
al. analyze an algorithm for explicit coscheduling of distributed tasks imple- 
mented on a Linux cluster; the proposed algorithm was evaluated on the basis of 
three applications from the NAS parallel benchmark. Fernandez et al. describe 
a parallel disk scheduling system architecture divided into three main queues 
(deterministic, statistic, and best-effort requests) to conclude that satisfying the 
deterministic request did not impair the scalability of the solution. 

Yook and Park propose a new scheduling algorithm for the redistribution of 
one-dimensional arrays between different GENJ3LOCKS; results of experiments 
on two platforms assess the effect of for instance, the number of processors and 
the array size to show the superiority of the proposed algorithm. 

Sobral and Proenga discuss a scalable object-oriented parallel programming 
system for load balancing at both compile- and run-time; the heuristic behind 
the methodology and packing policies is discussed in detail and evaluated in 
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a multiplatform comprising one cluster of PCs running Linux and two other 
architectures based on 16 node PowerPC601 66 MHz and 56 transputers T805, 
both running Parix. 

Gonzalez-Escribano et al. evaluate the consequences of series-parallel struc- 
ture programming models in shared-memory machines. They conclude that many 
parallel computations can be expressed using a structured parallel programming 
model with limited loss of parallelism, both at algorithm level and the machine 
execution level. 

Purnell, Corr, and Milligan touch upon the subject of automatic code paral- 
lelization; here, they present a neural network based tool, called KATT, which 
stands for knowledge assisted transformation tools. 

Sinnen and Sousa give an overview of the most common graph theoretic 
models used for mapping and scheduling; they discuss the design of a new paral- 
lelizing tool based on the so-called annotated hierarchical graph, which tries to 
integrate various standard graph models namely the directed acyclic, iterative, 
and temporal communication graphs. 

Figueiredo et al. analyze the performance of distributed shared-memory mul- 
tiprocessors to conclude that a hardware-based block-multithreaded HDSM con- 
figuration outperforms a software-multithreaded counterpart on average by 13%; 
the conclusions were reached on the basis of a set of benchmarks from SPLAS-2 
and C4.5 . 

Del Pino et al. study the inclusion of value prediction techniques in embedded 
processors. They conclude on the advantages of this technique, particularly in 
the case of communication and multimedia applications, which display a high 
degree of predictability. 




Computational Grids* 

Invited Talk 



Ian Foster^ and Carl Kesselman^ 

^ Mathematics and Computer Science Division 
Argonne National Laboratory 
Argonne, IL 60439 
itf @mcs . anl . gov 
^ Information Sciences Institute 
University of Southern California 
Marina de Rey, CA 90292 
carlOisi . edu 



Abstract. In this introductory chapter, we lay the groundwork for the 
rest of the book by providing a more detailed picture of the expected 
purpose, shape, and architecture of future grid systems. We structure 
the chapter in terms of six questions that we believe are central to this 
discussion: Why do we need computational grids? What types of appli- 
cations will grids be used for? Who will use grids? How will grids be 
used? What is involved in building a grid? And, what problems must 
be solved to make grids commonplace? We provide an overview of each 
of these issues here, referring to subsequent chapters for more detailed 
discussion. 



1 Reasons for Computational Grids 

Why do we need computational grids? Computational approaches to problem 
solving have proven their worth in almost every field of human endeavor. Com- 
puters are used for modeling and simulating complex scientific and engineering 
problems, diagnosing medical conditions, controlling industrial equipment, fore- 
casting the weather, managing stock portfolios, and many other purposes. Yet, 
although there are certainly challenging problems that exceed our ability to solve 
them, computers are still used much less extensively than they could be. To pick 
just one example, university researchers make extensive use of computers when 
studying the impact of changes in land use on biodiversity, but city planners 
selecting routes for new roads or planning new zoning ordinances do not. Yet it 
is local decisions such as these that, ultimately, shape our future. 

* This work was supported by the Mathematical, Information, and Computational Sci- 
ences Division subprogram of the Offiee of Advanced Scientific Computing Research, 
U.S. Department of Energy, under Contract W-31-109-Eng-38. 

Reprinted by permission of Morgan Kauffman Publishers fromT/ie Grid: Blueprint 
for a New Computing Infrastructure, I. Foster and C. Kesselman (Eds), 1998 



J.M.L.M. Palma et al. (Eds.): VECPAR2000, LNCS 1981, pp. 3-37, 2001. 
@ Springer- Verlag Berlin Heidelberg 2001 
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There are a variety of reasons for this relative lack of use of computational 
problem-solving methods, including lack of appropriate education and tools. But 
one important factor is that the average computing environment remains inad- 
equate for such computationally sophisticated purposes. While today’s PC is 
faster than the Cray supercomputer of 10 years ago, it is still far from adequate 
for predicting the outcome of complex actions or selecting from among many 
choices. That, after all, is why supercomputers have continued to evolve. 



1.1 Increasing Delivered Computation 

We believe that the opportunity exists to provide users — whether city plan- 
ners, engineers, or scientists — with substantially more computational power: an 
increase of three orders of magnitude within five years, and five orders of magni- 
tude within a decade. These dramatic increases will be achieved by innovations 
in a wide range of areas: 

1. Technology improvement: Evolutionary changes in VLSI technology and mi- 
croprocessor architecture can be expected to result in a factor of 10 increase 
in computational capabilities in the next five years, and a factor of 100 in- 
crease in the next ten. 

2. Increase in demand-driven access to computational power: Many applications 
have only episodic requirements for substantial computational resources. For 
example, a medical diagnosis system may be run only when a cardiogram 
is performed, a stockmarket simulation only when a user recomputes re- 
tirement benefits, or a seismic simulation only after a major earthquake. If 
mechanisms are in place to allow reliable, instantaneous, and transparent 
access to high-end resources, then from the perspective of these applications 
it is as if those resources are dedicated to them. Given the existence of mul- 
titeraFLOP systems, an increase in apparent computational power of three 
or more orders of magnitude is feasible. 

3. Increased utilization of idle capacity: Most low-end computers (PCs and 
workstations) are often idle: varrious studies report utilizations of around 
30% in academic and commercial environments [46], [21]. Utilization can 
be increased by a factor of two, even for parallel programs [4], without im- 
pinging significantly on productivity. The benefit to individual users can be 
substantially greater: factors of 100 or 1,000 increase in peak computational 
capacity have been reported [40], [74]. 

4. Greater sharing of computational results: The daily weather forecast involves 
perhaps 10^^ numerical operations. If we assume that the forecast is of ben- 
efit to 10^ people, we have 10^^ effective operations — comparable to the 
computation performed each day on all the world’s PCs. Few other compu- 
tational results or facilities are shared so effectively today, but they may be 
in the future as other scientific communities adopt a “big science” approach 
to computation. The key to more sharing may be the development of collab- 
oratories: “. . . center[s] without walls, in which the nation’s researchers can 
perform their research without regard to geographical location — interacting 
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with colleagues, accessing instrumentation, sharing data and computational 
resources, and accessing information in digital libraries” [47]. 

5. New problem-solving techniques and tools: A variety of approaches can im- 
prove the efficiency or ease with which computation is applied to problem 
solving. For example, network-enabled solvers [17], [11] allow users to invoke 
advanced numerical solution methods without having to install sophisticated 
software. Teleimmersion techniques [49] facilitate the sharing of computa- 
tional results by supporting collaborative steering of simulations and explo- 
ration of data sets. 

Underlying each of these advances is the synergistic use of high-performance 
networking, computing, and advanced software to provide access to advanced 
computational capabilities, regardless of the location of users and resources. 

1.2 Definition of Computational Grids 

The current status of computation is analogous in some respects to that of 
electricity around 1910. At that time, electric power generation was possible, 
and new devices were being devised that depended on electric power, but the 
need for each user to build and operate a new generator hindered use. The truly 
revolutionary development was not, in fact, electricity, but the electric power grid 
and the associated transmission and distribution technologies. Together, these 
developments provided reliable, low-cost access to a standardized service, with 
the result that power — which for most of human history has been accessible 
only in crude and not especially portable forms (human effort, horses, water 
power, steam engines, candles) — became universally accessible. By allowing both 
individuals and industries to take for granted the availability of cheap, reliable 
power, the electric power grid made possible both new devices and the new 
industries that manufactured them. 

By analogy, we adopt the term computational grid for the infrastructure that 
will enable the increases in computation discussed above. A computational grid 
is a hardware and software infrastructure that provides dependable, consistent, 
pervasive, and inexpensive access to high-end computational capabilities. 

We talk about an infrastructure because a computational grid is concerned, 
above all, with large-scale pooling of resources, whether compute cycles, data, 
sensors, or people. Such pooling requires significant hardware infrastructure to 
achieve the necessary interconnections and software infrastructure to monitor 
and control the resulting ensemble. In the rest of this chapter, and throughout 
the book, we discuss in detail the nature of this infrastructure. 

The need for dependable service is fundamental. Users require assurances that 
they will receive predictable, sustained, and often high levels of performance from 
the diverse components that constitute the grid; in the absence of such assur- 
ances, applications will not be written or used. The performance characteristics 
that are of interest will vary widely from application to application, but may 
include network bandwidth, latency, jitter, computer power, software services, 
security, and reliability. 
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The need for consistency of service is a second fundamental concern. As with 
electric power, we need standard services, accessible via standard interfaces, 
and operating within standard parameters. Without such standards, application 
development and pervasive use are impractical. A significant challenge when de- 
veloping standards is to encapsulate heterogeneity without compromising high- 
performance execution. 

Pervasive access allows us to count on services always being available, within 
whatever environment we expect to move. Pervasiveness does not imply that 
resources are everywhere or are universally accessible. We cannot access electric 
power in a new home until wire has been laid and an account established with the 
local utility; computational grids will have similarly circumscribed availability 
and controlled access. However, we will be able to count on universal access 
within the confines of whatever environment the grid is designed to support. 

Finally, an infrastructure must offer inexpensive (relative to income) access 
if it is to be broadly accepted and used. Homeowners and industrialists both 
make use of remote billion-dollar power plants on a daily basis because the cost 
to them is reasonable. A computational grid must achieve similarly attractive 
economics. 

It is the combination of dependability, consistency, and pervasiveness that 
will cause computational grids to have a transforming effect on how computa- 
tion is performed and used. By increasing the set of capabilities that can be 
taken for granted to the extent that they are noticed only by their absence, 
grids allow new tools to be developed and widely deployed. Much as pervasive 
access to bitmapped displays changed our baseline assumptions for the design of 
application interfaces, computational grids can fundamentally change the way 
we think about computation and resources. 

1.3 The Impact of Grids 

The history of network computing shows that orders-of-magnitude improvements 
in underlying technology invariably enable revolutionary, often unanticipated, 
applications of that technology, which in turn motivate further technological im- 
provements. As a result, our view of network computing has undergone repeated 
transformations over the past 40 years. 

There is considerable evidence that another such revolution is imminent. The 
capabilities of both computers and networks continue to increase dramatically. 
Ten years of research on metacomputing has created a solid base of experience in 
new applications that couple high-speed networking and computing. The time 
seems ripe for a transition from the heroic days of metacomputing to more 
integrated computational grids with dependable and pervasive computational 
capabilities and consistent interfaces. In such grids, today’s metacomputing ap- 
plications will be routine, and programmers will be able to explore a new gen- 
eration of yet more interesting applications that leverage teraFLOP computers 
and petabyte storage systems interconnected by gigabit networks. We present 
two simple examples to illustrate how grid functionality may transform different 
aspects of our lives. 
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Today’s home finance software packages leverage the pervasive availability of 
communication technologies such as modems, Internet service providers, and the 
Web to integrate up-to-date stock prices obtained from remote services into local 
portfolio value calculations. However, the actual computations performed on 
this data are relatively simple. In tomorrow’s grid environment, we can imagine 
individuals making stock-purchasing decisions on the basis of detailed Monte 
Carlo analyses of future asset value, performed on remote teraFLOP computers. 
The instantaneous use of three orders of magnitude more computing power than 
today will go unnoticed by prospective retirees, but their lives will be different 
because of more accurate information. 

Today, citizen groups evaluating a proposed new urban development must 
study uninspiring blueprints or perspective drawings at city hall. A computa- 
tional grid will allow them to call on powerful graphics computers and databases 
to transform the architect’s plans into realistic virtual reality depictions and to 
explore such design issues as energy consumption, lighting efficiency, or sound 
quality. Meeting online to walk through and discuss the impact of the new de- 
velopment on their community, they can arrive at better urban design and hence 
improved quality of life. Virtual reality-based simulation models of Los Ange- 
les, produced by William Jepson, and the walkthrough model of Soda Hall at 
the University of California-Berkeley, constructed by Carlo Seguin and his col- 
leagues, are interesting exemplars of this use of computing [9]. 



1.4 Electric Power Grids 

We conclude this section by reviewing briefly some salient features of the com- 
putational grid’s namesake. The electric power grid is remarkable in terms of 
its construction and function, which together make it one of the technological 
marvels of the 20th century. Within large geographical regions (e.g.. North Amer- 
ica), it forms essentially a single entity that provides power to billions of devices, 
in a relatively efficient, low-cost, and reliable fashion. The North American grid 
alone links more than ten thousand generators with billions of outlets via a com- 
plex web of physical connections and trading mechanisms [12]. The components 
from which the grid is constructed are highly heterogeneous in terms of their 
physical characteristics and are owned and operated by different organizations. 
Consumers differ significantly in terms of the amount of power they consume, 
the service guarantees they require, and the amount they are prepared to pay. 

Analogies are dangerous things, and electricity is certainly very different from 
computation in many respects. Nevertheless, the following aspects of the power 
grid seem particularly relevant to the current discussion. 



Importance of Economics The role and structure of the power grid are driven 
to a large extent by economic factors. Oil- and coal-fired generators have signif- 
icant economies of scale. A power company must be able to call upon reserve 
capacity equal to its largest generator in case that generator fails; intercon- 
nections between regions allow for sharing of such reserve capacity, as well as 
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enabling trading of excess power. The impact of economic factors on computa- 
tional grids is not well understood [34]. Where and when are there economies 
of scale to be obtained in computational capabilities? Might economic factors 
lead us away from today’s model of a “computer on every desktop”? We note an 
intriguing development. Recent advances in power generation technology (e.g., 
small gas turbines) and the deregulation of the power industry are leading some 
analysts to look to the Internet for lessons regarding the future evolution of the 
electric power grid! 



Importance of Politics The developers of large-scale grids tell us that their 
success depended on regulatory, political, and institutional developments as 
much as on technical innovation [12]. This lesson should be taken to heart by 
developers of future computational grids. 



Complexity of Control The principal technical challenges in power grids — 
once technology issues relating to efficient generation and high-voltage trans- 
mission had been overcome — relate to the management of a complex ensemble 
in which changes at a single location can have far-reaching consequences [12]. 
Hence, we find that the power grid includes a sophisticated infrastructure for 
monitoring, management, and control. Again, there appear to be many paral- 
lels between this control problem and the problem of providing performance 
guarantees in large-scale, dynamic, and heterogeneous computational grid envi- 
ronments. 

2 Grid Applications 

What types of applications will grids be used for? Building on experiences in 
gigabit testbeds [41], [58], the I- WAY network [19], and other experimental sys- 
tems, we have identified five major application classes for computational grids, 
listed in Table 2 and described briefly in this section. More details about appli- 
cations and their technical requirements are provided in the referenced chapters. 



2.1 Distributed Supercomputing 

Distributed supercomputing applications use grids to aggregate substantial com- 
putational resources in order to tackle problems that cannot be solved on a single 
system. Depending on the grid on which we are working (see Section 3), these 
aggregated resources might comprise the majority of the supercomputers in the 
country or simply all of the workstations within a company. Here are some con- 
temporary examples: 

— Distributed interactive simulation (DIS) is a technique used for training and 
planning in the military. Realistic scenarios may involve hundreds of thou- 
sands of entities, each with potentially complex behavior patterns. Yet even 
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Table 1. Five major classes of grid applications. 



Category 


Examples 


Characteristics 


Distributed 

supercomputing 


DIS 

Stellar dynamics 
Ab initio chemistry 


Very large problems needing 
lots of CPU, memory, etc. 


High 

throughput 


Chip design 
Parameter studies 
Cryptographic problems 


Harness many otherwise idle 
resources to increase 
aggregate throughput 


On demand 


Medical instrumentation 
Network-enabled solvers 
Cloud detection 


Remote resources integrated 
with local computation, often 
for bounded amount of time 


Data 

intensive 


Sky survey 
Physics data 
Data assimilation 


Synthesis of new information 
from many or large data sources 


Collaborative 


Collaborative design 
Data exploration 
Education 


Support communication or 
collaborative work between 
multiple participants 



the largest current supercomputers can handle at most 20,000 entities. In re- 
cent work, researchers at the California Institute of Technology have shown 
how multiple supercomputers can be coupled to achieve record-breaking lev- 
els of performance. 

— The accurate simulation of complex physical processes can require high spa- 
tial and temporal resolution in order to resolve fine-scale detail. Coupled 
supercomputers can be used in such situations to overcome resolution bar- 
riers and hence to obtain qualitatively new scientific results. Although high 
latencies can pose significant obstacles, coupled supercomputers have been 
used successfully in cosmology [53], high-resolution ab initio computational 
chemistry computations [51], and climate modeling [44]. 

Challenging issues from a grid architecture perspective include the need to 
coschedule what are often scarce and expensive resources, the scalability of pro- 
tocols and algorithms to tens or hundreds of thousands of nodes, latency-tolerant 
algorithms, and achieving and maintaining high levels of performance across het- 
erogeneous systems. 



2.2 High-Throughput Computing 

In high-throughput computing, the grid is used to schedule large numbers of 
loosely coupled or independent tasks, with the goal of putting unused processor 
cycles (often from idle workstations) to work. The result may be, as in distributed 
supercomputing, the focusing of available resources on a single problem, but the 
quasi-independent nature of the tasks involved leads to very different types of 
problems and problem-solving methods. Here are some examples: 
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— Platform Computing Corporation reports that the microprocessor manufac- 
turer Advanced Micro Devices used high-throughput computing techniques 
to exploit over a thousand computers during the peak design phases of their 
K6 and K7 microprocessors. These computers are located on the desktops 
of AMD engineers at a number of AMD sites and were used for design veri- 
fication only when not in use by engineers. 

— The Condor system from the University of Wisconsin is used to manage 
pools of hundreds of workstations at universities and laboratories around the 
world [40] . These resources have been used for studies as diverse as molecular 
simulations of liquid crystals, studies of ground-penetrating radar, and the 
design of diesel engines. 

— More loosely organized efforts have harnessed tens of thousands of computers 
distributed worldwide to tackle hard cryptographic problems [39] . 

2.3 On-Demand Computing 

On-demand applications use grid capabilities to meet short-term requirements 
for resources that cannot be cost-effectively or conveniently located locally. These 
resources may be computation, software, data repositories, specialized sensors, 
and so on. In contrast to distributed supercomputing applications, these ap- 
plications are often driven by cost-performance concerns rather than absolute 
performance. For example: 

— The NEOS [17] and NetSolve [11] network-enhanced numerical solver sys- 
tems allow users to couple remote software and resources into desktop appli- 
cations, dispatching to remote servers calculations that are computationally 
demanding or that require specialized software. 

— A computer-enhanced MRI machine and scanning tunneling microscope 
(STM) developed at the National Center for Supercomputing Applications 
use supercomputers to achieve realtime image processing [56], [57]. The re- 
sult is a significant enhancement in the ability to understand what we are 
seeing and, in the case of the microscope, to steer the instrument. 

— A system developed at the Aerospace Corporation for processing of data from 
meteorological satellites uses dynamically acquired supercomputer resources 
to deliver the results of a cloud detection algorithm to remote meteorologists 
in quasi real time [37]. 

The challenging issues in on-demand applications derive primarily from the 
dynamic nature of resource requirements and the potentially large populations 
of users and resources. These issues include resource location, scheduling, code 
management, configuration, fault tolerance, security, and payment mechanisms. 

2.4 Data-Intensive Computing 

In data-intensive applications, the focus is on synthesizing new information 
from data that is maintained in geographically distributed repositories, digi- 
tal libraries, and databases. This synthesis process is often computationally and 
communication intensive as well. 
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— Future high-energy physics experiments will generate terabytes of data per 
day, or around a petabyte per year. The complex queries used to detect 
“interesting” events may need to access large fractions of this data [42] . The 
scientific collaborators who will access this data are widely distributed, and 
hence the data systems in which data is placed are likely to be distributed 
as well. 

— The Digital Sky Survey will ultimately make many terabytes of astronomical 
photographic data available in numerous network-accessible databases. This 
facility enables new approaches to astronomical research based on distributed 
analysis, assuming that appropriate computational grid facilities exist. 

— Modern meteorological forecasting systems make extensive use of data as- 
similation to incorporate remote satellite observations. The complete process 
involves the movement and processing of many gigabytes of data. 

Challenging issues in data-intensive applications are the scheduling and con- 
figuration of complex, high-volume data flows through multiple levels of hierar- 
chy. 



2.5 Collaborative Computing 

Collaborative applications are concerned primarily with enabling and enhancing 
human-to-human interactions. Such applications are often structured in terms 
of a virtual shared space. Many collaborative applications are concerned with 
enabling the shared use of computational resources such as data archives and 
simulations; in this case, they also have characteristics of the other application 
classes just described. For example: 

— The BoilerMaker system developed at Argonne National Laboratory allows 
multiple users to collaborate on the design of emission control systems in 
industrial incinerators [20]. The different users interact with each other and 
with a simulation of the incinerator. 

— The CAVE5D system supports remote, collaborative exploration of large 
geophysical data sets and the models that generate them — for example, a 
coupled physical/biological model of the Chesapeake Bay [73]. 

— The NICE system developed at the University of Illinois at Chicago allows 
children to participate in the creation and maintenance of realistic virtual 
worlds, for entertainment and education [59]. 

Challenging aspects of collaborative applications from a grid architecture 
perspective are the realtime requirements imposed by human perceptual capa- 
bilities and the rich variety of interactions that can take place. 

We conclnde this section with three general observations. First, we note that 
even in this brief survey we see a tremendous variety of already successful appli- 
cations. This rich set has been developed despite the significant difficulties faced 
by programmers developing grid applications in the absence of a mature grid 
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infrastructure. As grids evolve, we expect the range and sophistication of appli- 
cations to increase dramatically. Second, we observe that almost all of the appli- 
cations demonstrate a tremendous appetite for computational resources (CPU, 
memory, disk, etc.) that cannot be met in a timely fashion by expected growth in 
single-system performance. This emphasizes the importance of grid technologies 
as a means of sharing computation as well as a data access and communication 
medium. Third, we see that many of the applications are interactive, or depend 
on tight synchronization with computational components, and hence depend 
on the availability of a grid infrastructure able to provide robust performance 
guarantees. 

3 Grid Communities 

Who will use grids? One approach to understanding computational grids is to 
consider the communities that they serve. Because grids are above all a mech- 
anism for sharing resources, we ask, What groups of people will have sufficient 
incentive to invest in the infrastructure required to enable sharing, and what 
resources will these communities want to share? 

One perspective on these questions holds that the benefits of sharing will 
almost always outweigh the costs and, hence, that we will see grids that link 
large communities with few common interests, within which resource sharing 
will extend to individual PCs and workstations. If we compare a computational 
grid to an electric power grid, then in this view, the grid is quasi-universal, and 
every user has the potential to act as a cogenerator. Skeptics respond that the 
technical and political costs of sharing resources will rarely outweigh the benefits, 
especially when coupling must cross institutional boundaries. Hence, they argue 
that resources will be shared only when there is considerable incentive to do so: 
because the resource is expensive, or scarce, or because sharing enables human 
interactions that are otherwise difficult to achieve. In this view, grids will be 
specialized, designed to support specific user communities with specific goals. 

Rather than take a particular position on how grids will evolve, we propose 
what we see as four plausible scenarios, each serving a different community. 
Future grids will probably include elements of all four. 

3.1 Government 

The first community that we consider comprises the relatively small number — 
thousands or perhaps tens of thousands — of officials, planners, and scientists 
concerned with problems traditionally assigned to national government, such as 
disaster response, national defense, and long-term research and planning. There 
can be signihcant advantage to applying the collective power of the nation’s 
fastest computers, data archives, and intellect to the solution of these problems. 
Hence, we envision a grid that uses the fastest networks to couple relatively 
small numbers of high-end resources across the nation — perhaps tens of ter- 
aFLOP computers, petabytes of storage, hundreds of sites, thousands of smaller 
systems — for two principal purposes: 
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1. To provide a “strategic computing reserve,” allowing substantial computing 
resources to be applied to large problems in times of crisis, such as to plan 
responses to a major environmental disaster, earthquake, or terrorist attack 

2. To act as a “national collaboratory,” supporting collaborative investigations 
of complex scientific and engineering problems, such as global change, space 
station design, and environmental cleanup 

An important secondary benefit of this high-end national supercomputing 
grid is to support resource trading between the various operators of high-end 
resources, hence increasing the efficiency with which those resources are used. 

This national grid is distinguished by its need to integrate diverse high-end 
(and hence complex) resources, the strategic importance of its overall mission, 
and the diversity of competing interests that must be balanced when allocating 
resources. 

3.2 A Health Maintenance Organization 

In our second example, the community supported by the grid comprises admin- 
istrators and medical personnel located at a small number of hospitals within a 
metropolitan area. The resources to be shared are a small number of high-end 
computers, hundreds of workstations, administrative databases, medical image 
archives, and specialized instruments such as MRI machines, CAT scanners, and 
cardioangiography devices. The coupling of these resources into an integrated 
grid enables a wide range of new, computationally enhanced applications: desk- 
top tools that use centralized supercomputer resources to run computer-aided 
diagnosis procedures on mammograms or to search centralized medical image 
archives for similar cases; life-critical applications such as telerobotic surgery and 
remote cardiac monitoring and analysis; auditing software that uses the many 
workstations across the hospital to run fraud detection algorithms on financial 
records; and research software that uses supercomputers and idle workstations 
for epidemiological research. Each of these applications exists today in research 
laboratories but has rarely been deployed in ordinary hospitals because of the 
high cost of computation. 

This private grid is distinguished by its relatively small scale, central man- 
agement, and common purpose on the one hand, and on the other hand by the 
complexity inherent in using common infrastructure for both life-critical applica- 
tions and less reliability-sensitive purposes and by the need to integrate low-cost 
commodity technologies. We can expect grids with similar characteristics to be 
useful in many institutions. 

3.3 A Materials Science Collaboratory 

The community in our third example is a group of scientists who operate and 
use a variety of instruments, such as electron microscopes, particle accelera- 
tors, and X-ray sources, for the characterization of materials. This community is 
fluid and highly distributed, comprising many hundreds of university researchers 




14 



Ian Foster and Carl Kesselman 



and students from around the world, in addition to the operators of the vari- 
ous instruments (tens of instruments, at perhaps ten centers). The resources 
that are being shared include the instruments themselves, data archives contain- 
ing the collective knowledge of this community, sophisticated analysis software 
developed by different groups, and various supercomputers used for analysis. 
Applications enabled by this grid include remote operation of instruments, col- 
laborative analysis, and supercomputer-based online analysis. 

This virtual grid is characterized by a strong unifying focus and relatively 
narrow goals on the one hand, and on the other hand by dynamic membership, a 
lack of central control, and a frequent need to coexist with other uses of the same 
resources. We can imagine similar grids arising to meet the needs of a variety of 
multi-institutional research groups and multicompany virtual teams created to 
pursue long- or short-term goals. 



3.4 Computational Market Economy 

The fourth community that we consider comprises the participants in a broad- 
based market economy for computational services. This is a potentially enormous 
community with no connections beyond the usual market-oriented relationships. 
We can expect participants to include consumers, with their diverse needs and 
interests; providers of specialized services, such as financial modeling, graph- 
ics rendering, and interactive gaming; providers of compute resources; network 
providers, who contract to provide certain levels of network service; and various 
other entities such as banks and licensing organizations. 

This public grid is in some respects the most intriguing of the four scenarios 
considered here, but is also the least concrete. One area of uncertainty con- 
cerns the extent to which the average consumer will also act as a producer 
of computational resources. The answer to this question seems to depend on 
two issues. Will applications emerge that can exploit loosely coupled compu- 
tational resources? And, will owners of resources be motivated to contribute 
resources? To date, large-scale activity in this area has been limited to fairly 
esoteric computations — such as searching for prime numbers, breaking crypto- 
graphic codes [39], or detecting extraterrestrial communications [63] — with the 
benefit to the individuals being the fun of participating and the potential mo- 
mentary fame if their computer solves the problem in question. 

We conclude this section by noting that, in our view, each of these scenar- 
ios seems quite feasible; indeed, substantial prototypes have been created for 
each of the grids that we describe. Hence, we expect to see not just one single 
computational grid, but rather many grids, each serving a different community 
with its own requirements and objectives. Just which grids will evolve depends 
critically on three issues: the evolving economics of computing and networking, 
and the services that these physical infrastructure elements are used to provide; 
the institutional, regulatory, and political frameworks within which grids may 
develop; and, above all, the emergence of applications able to motivate users to 
invest in and use grid technologies. 
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4 Using Grids 

How will grids be used? In metacomputing experiments conducted to date, users 
have been “heroic” programmers, willing to spend large amounts of time pro- 
gramming complex systems at a low level. The resulting applications have pro- 
vided compelling demonstrations of what might be, but in most cases are too 
expensive, unreliable, insecure, and fragile to be considered suitable for general 
use. 

For grids to become truly useful, we need to take a significant step for- 
ward in grid programming, moving from the equivalent of assembly language 
to high-level languages, from one-off libraries to application toolkits, and from 
hand-crafted codes to shrink-wrapped applications. These goals are familiar to 
us from conventional programming, but in a grid environment we are faced with 
the additional difficulties associated with wide area operation — in particular, 
the need for grid applications to adapt to changes in resource properties in order 
to meet performance requirements. As in conventional computing, an impor- 
tant step toward the realization of these goals is the development of standards 
for applications, programming models, tools, and services, so that a division of 
labor can be achieved between the users and developers of different types of 
components. 

We structure our discussion of grid tools and programming in terms of the 
classification illustrated in Table 2. At the lowest level, we have grid developers — 
the designers and implementors of what we might call the “Grid Protocol,” by 
analogy with the Internet Protocol that provides the lowest-level services in the 
Internet — who provide the basic services required to construct a grid. Above this, 
we have tool developers, who use grid services to construct programming models 
and associated tools, layering higher-level services and abstractions on top of 
the more fundamental services provided by the grid architecture. Application 
developers, in turn, build on these programming models, tools, and services to 
construct grid-enabled applications for end users who, ideally, can use these 
applications without being concerned with the fact that they are operating in a 
grid environment. A fifth class of users, system administrators, is responsible for 
managing grid components. We now examine this model in more detail. 

4.1 Grid Developers 

A very small group of grid developers are responsible for implementing the basic 
services referred to above. We discuss the concerns encountered at this level in 
Section 5. 

4.2 Tool Developers 

Our second group of users are the developers of the tools, compilers, libraries, and 
so on that implement the programming models and services used by application 
developers. Today’s small population of grid tool developers (e.g., the developers 
of Condor [40], Nimrod [1], NEOS [17], NetSolve [11], Horus [67], grid-enabled 
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Table 2. Classes of grid users. 



Class 


Purpose 


Makes use of 


Concerns 


End users 


Solve 

problems 


Applications 


Transparency, 

performance 


Application 

developers 


Develop 

applications 


Programming 
models, tools 


Ease of use, 
performance 


Tool 

developers 


Develop tools, 
programming models 


Grid 

services 


Adaptivity, exposure of 
performance, security 


Grid 

developers 


Provide basic 
grid services 


Local system 
services 


Local simplicity, 
connectivity, security 


System 

administrators 


Manage 
grid resources 


Management 

tools 


Balancing local 
and global concerns 



implementations of the Message Passing Interface (MPI) [27], and CAVERN [38]) 
must build their tools on a very narrow foundation, comprising little more than 
the Internet Protocol. We envision that future grid systems will provide a richer 
set of basic services, hence making it possible to build more sophisticated and 
robust tools. We discuss the nature and implementation of those basic services 
in Section 5; briefly, they comprise versions of those services that have proven 
effective on today’s end systems and clusters, such as authentication, process 
management, data access, and communication, plus new services that address 
specific concerns of the grid environment, such as resource location, information, 
fault detection, security, and electronic payment. 

Tool developers must use these basic services to provide efficient implemen- 
tations of the programming models that will be used by application developers. 
In constructing these translations, the tool developer must be concerned not 
only with translating the existing model to the grid environment, but also with 
revealing to the programmer those aspects of the grid environment that impact 
performance. For example, a grid-enabled MPI [27] can seek to adapt the MPI 
model for grid execution by incorporating specialized techniques for point-to- 
point and collective communication in highly heterogeneous environments; im- 
plementations of collective operations might use multicast protocols and adapt a 
combining tree structure in response to changing network loads. It should prob- 
ably also extend the MPI model to provide programmers with access to resource 
location services, information about grid topology, and group communication 
protocols. 



4.3 Application Developers 

Our third class of users comprises those who construct grid-enabled applications 
and components. Today, these programmers write applications in what is, in 
effect, an assembly language: explicit calls to the Internet Protocol’s User Data- 
gram Protocol (UDP) or Transmission Control Protocol (TCP), explicit or no 
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management of failure, hard-coded configuration decisions for specific comput- 
ing systems, and so on. We are far removed from the portable, efficient, high- 
level languages that are used to develop sequential programs, and the advanced 
services that programmers can rely upon when using these languages, such as 
dynamic memory management and high-level I/O libraries. 

Future grids will need to address the needs of application developers in two 
ways. They must provide programming models (supported by languages, libraries, 
and tools) that are appropriate for grid environments and a range of services 
(for security, fault detection, resource management, data access, communication, 
etc.) that programmers can call upon when developing applications. 

The purpose of both programming models and services is to simplify thinking 
about and implementing complex algorithmic structures, by providing a set of 
abstractions that hide details unrelated to the application, while exposing design 
decisions that have a significant impact on program performance or correctness. 
In sequential programming, commonly used programming models provide us 
with abstractions such as subroutines and scoping; in parallel programming, we 
have threads and condition variables (in shared- memory parallelism), message 
passing, distributed arrays, and single-assignment variables. Associated services 
ensure that resources are allocated to processes in a reasonable fashion, provide 
convenient abstractions for tertiary storage, and so forth. 

There is no consensus on what programming model is appropriate for a grid 
environment, although it seems clear that many models will be used. Table 3 
summarizes some of the models that have been proposed; new models will emerge 
as our understanding of grid programming evolves. 

As Table 3 makes clear, one approach to grid programming is to adapt mod- 
els that have already proved successful in sequential or parallel environments. 
For example, a grid-enabled distributed shared-memory (DSM) system would 
support a shared- memory programming model in a grid environment, allowing 
programmers to specify parallelism in terms of threads and shared-memory oper- 
ations. Similarly, a grid-enabled MPI would extend the popular message-passing 
model [27], and a grid-enabled hie system would permit remote hies to be ac- 
cessed via the standard UNIX application programming interface (API) [65]. 
These approaches have the advantage of potentially allowing existing applica- 
tions to be reused unchanged, but can introduce signihcant performance prob- 
lems if the models in question do not adapt well to high-latency, dynamic, het- 
erogeneous grid environments. 

Another approach is to build on technologies that have proven effective in 
distributed computing, such as Remote Procedure Call (RPC) or related object- 
based techniques such as the Common Object Request Broker Architecture 
(CORE A). These technologies have signihcant software engineering advantages, 
because their encapsulation properties facilitate the modular construction of 
programs and the reuse of existing components. However, it remains to be seen 
whether these models can support performance-focused, complex applications 
such as teleimmersion or the construction of dynamic computations that span 
hundreds or thousands of processors. 




18 



Ian Foster and Carl Kesselman 



Table 3. Potential grid programming models and their advantages and disad- 
vantages. 



Model 


Examples 


Pros 


Cons 


Datagram / stream 
communication 


UDP, TCP, 
Multicast 


Low overhead 


Low level 


Shared memory, 
multithreading 


POSIX Threads 
DSM 


High level 


Scalability 


Data parallelism 


HPF, HPC-h-l- 


Automatic 

parallelization 


Restricted 

applicability 


Message passing 


MPI, PVM 


High performance 


Low level 


Object-oriented 


CORBA, DCOM, 
Java RMI 


Support for 
large-system design 


Performance 


Remote procedure 
call 


DCE, ONC 


Simplicity 


Restricted 

applicability 


High throughput 


Condor, LSF, 
Nimrod 


Ease of use 


Restricted 

applicability 


Group ordered 


Isis, Totem 


Robustness 


Performance, 

scalability 


Agents 


Aglets, 

Telescript 


Flexibility 


Performance, 

robustness 



The grid environment can also motivate new programming models and ser- 
vices. For example, high-throughput computing systems, as exemplified by Con- 
dor [40] and Nimrod [1], support problem-solving methods such as parameter 
studies in which complex problems are partitioned into many independent tasks. 
Group-ordered communication systems represent another model that is impor- 
tant in dynamic, unpredictable grid environments; they provide services for man- 
aging groups of processes and for delivering messages reliably to group members. 
Agent-based programming models represent another approach apparently well 
suited to grid environments; here, programs are constructed as independent en- 
tities that roam the network searching for data or performing other tasks on 
behalf of a user. 

A wide range of new services can be expected to arise in grid environments 
to support the development of more complex grid applications. In addition to 
grid analogs of conventional services such as file systems, we will see new ser- 
vices for resource discovery, resource brokering, electronic payments, licensing, 
fault tolerance, specification of use conditions, configuration, adaptation, and 
distributed system management, to name just a few. 

4.4 End Users 

Most grid users, like most users of computers or networks today, will not write 
programs. Instead, they will use grid-enabled applications that make use of grid 
resources and services. These applications may be chemistry packages or envi- 
ronmental models that use grid resources for computing or data; problem-solving 





Computational Grids 



19 



packages that help set up parameter study experiments [1] ; mathematical pack- 
ages augmented with calls to network-enabled solvers [17], [11]; or collaborative 
engineering packages that allow geographically separated users to cooperate on 
the design of complex systems. 

End users typically place stringent requirements on their tools, in terms of 
reliability, predictability, confidentiality, and usability. The construction of ap- 
plications that can meet these requirements in complex grid environments rep- 
resents a major research and engineering challenge. 

4.5 System Administrators 

The final group of users that we consider are the system administrators who 
must manage the infrastructure on which computational grids operate. This 
task is complicated by the high degree of sharing that grids are designed to 
make possible. The user communities and resources associated with a particu- 
lar grid will frequently span multiple administrative domains, and new services 
will arise — such as accounting and resource brokering — that require distributed 
management. Furthermore, individual resources may participate in several dif- 
ferent grids, each with its own particular user community, access policies, and so 
on. For a grid to be effective, each participating resource must be administered 
so as to strike an appropriate balance between local policy requirements and 
the needs of the larger grid community. This problem has a significant political 
dimension, but new technical solutions are also required. 

The Internet experience suggests that two keys to scalability when adminis- 
tering large distributed systems are to decentralize administration and to auto- 
mate trans-site issues. For example, names and routes are administered locally, 
while essential trans-site services such as route discovery and name resolution 
are automated. Grids will require a new generation of tools for automatically 
monitoring and managing many tasks that are currently handled manually. 

New administration issues that arise in grids include establishing, monitor- 
ing, and enforcing local policies in situations where the set of users may be 
large and dynamic; negotiating policy with other sites and users; accounting 
and payment mechanisms; and the establishment and management of markets 
and other resource-trading mechanisms. There are interesting parallels between 
these problems and management issues that arise in the electric power and bank- 
ing industries (114, [31], [28]). 

5 Grid Architecture 

What is involved in building a grid? To address this question, we adopt a system 
architect’s perspective and examine the organization of the software infrastruc- 
ture required to support the grid users, applications, and services discussed in 
the preceding sections. 

As noted above, computational grids will be created to serve different com- 
munities with widely varying characteristics and requirements. Hence, it seems 
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unlikely that we will see a single grid architecture. However, we do believe that 
we can identify basic services that most grids will provide, with different grids 
adopting different approaches to the realization of these services. 

One major driver for the techniques used to implement grid services is scale. 
Computational infrastructure, like other infrastructures, is fractal, or self-similar 
at different scales. We have networks between countries, organizations, clusters, 
and computers; between components of a computer; and even within a single 
component. However, at different scales, we often operate in different physical, 
economic, and political regimes. For example, the access control solutions used 
for a laptop computer’s system bus are probably not appropriate for a trans- 
pacific cable. 

In this section, we adopt scale as the major dimension for comparison. We 
consider four types of systems, of increasing scale and complexity, asking two 
questions for each: What new concerns does this increase in scale introduce? 
And how do these new concerns influence how we provide basic services? These 
system types are as follows (see also Table 4): 

1. The end system provides the best model we have for what it means to com- 
pute, because it is here that most research and development efforts have 
focused in the past four decades. 

2. The cluster introduces new issues of parallelism and distributed manage- 
ment, albeit of homogeneous systems. 

3. The intranet introduces the additional issues of heterogeneity and geograph- 
ical distribution. 

4. The internet introduces issues associated with a lack of centralized control. 



An important secondary driver for architectural solutions is the performance 
requirements of the grid. Stringent performance requirements amplify the effect 
of scale because they make it harder to hide heterogeneity. For example, if per- 
formance is not a big concern, it is straightforward to extend UNIX file I/O 
to support access to remote files, perhaps via a HyperText Transport Protocol 
(HTTP) gateway [65]. However, if performance is critical, remote access may 
require quite different mechanisms — such as parallel transfers over a striped net- 
work from a remote parallel file system to a local parallel computer — that are not 
easily expressed in terms of UNIX file I/O semantics. Hence, a high-performance 
wide area grid may need to adopt quite different solutions to data access prob- 
lems. In the following, we assume that we are dealing with high-performance 
systems; systems with lower performance requirements are generally simpler. 



5.1 Basic Services 

We start our discussion of architecture by reviewing the basic services provided 
on conventional computers. We do so because we believe that, in the absence of 
strong evidence to the contrary, services that have been developed and proven 
effective in several decades of conventional computing will also be desirable in 
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Table 4. Computer systems operating at different scales. 



Computer model I/O model 


Resource management 


Security 


Endsystem: 


Multithreading, 

automatic 

parallelization 


Local I/O, 
disk-striping 


Process creation 
OS signal delivery 
OS scheduling 


OS kernel, 
hardware 


Cluster (increased scale, reduced integration): 


Synchronous 
communication, 
distributed shared 
memory 


Parallel I/O 
(e.g., MPI-IO), 
file systems 


Parallel process 
creation, gang 
scheduling, OS-level 
signal propagation 


Shared 

security 

databases 


Intranet (heterogeneity, separate administration, lack of global knowledge): 


Client/server, 
loosely synchronous: 
pipelines, coupling 
manager / worker 


Distributed file 
systems 
(DFS, HPSS), 
databases 


Resource discovery, 
signal distribution 
networks, 
high throughput 


Network 

security 

(Kerberos) 


Internet (lack of centralized control, geographical distribution, inti, issues): 


Collaborative 
systems, remote 
control, data 
mining 


Remote file access, 
digital libraries, 
data warehouses 


Brokers, 
trading, 
mobile code 
negotiation 


Trust dele- 
gation, public 
key, 

sandboxes 



computational grids. Grid environments also require additional services, but we 
claim that, to a significant extent, grid development will be concerned with 
extending familiar capabilities to the more complex wide area environment. 

Our purpose in this subsection is not to provide a detailed exposition of well- 
known ideas but rather to establish a vocabulary for subsequent discussion. We 
assume that we are discussing a generic modern computing system, and hence 
refrain from prefixing each statement with “in general,” “typically,” and the like. 
Individual systems will, of course, differ from the generic systems described here, 
sometimes in interesting and important ways. 

The first step in a computation that involves shared resources is an authen- 
tication process, designed to establish the identity of the user. A subsequent 
authorization process establishes the right of the user to create entities called 
processes. A process comprises one or more threads of control, created for ei- 
ther concurrency or parallelism, and executing within a shared address space. A 
process can also communicate with other processes via a variety of abstractions, 
including shared memory (with semaphores or locks), pipes, and protocols such 
as TCP/IP. 

A user (or process acting on behalf of a user) can control the activities in 
another process — for example, to suspend, resume, or terminate its execution. 
This control is achieved by means of asynchronously delivered signals. 

A process acts on behalf of its creator to acquire resources, by executing in- 
structions, occupying memory, reading and writing disks, sending and receiving 
messages, and so on. The ability of a process to acquire resources is limited 
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by underlying authorization mechanisms, which implement a system’s resource 
allocation policy, taking into account the user’s identity, prior resource consump- 
tion, and/or other criteria. Scheduling mechanisms in the underlying system deal 
with competing demands for resources and may also (for example, in realtime 
systems) support user requests for performance guarantees. 

Underlying accounting mechanisms keep track of resource allocations and 
consumption, and payment mechanisms may be provided to translate resource 
consumption into some common currency. The underlying system will also pro- 
vide protection mechanisms to ensure that one user’s computation does not in- 
terfere with another’s. 

Other services provide abstractions for secondary storage. Of these, virtual 
memory is implicit, extending the shared address space abstraction already 
noted; file systems and databases are more explicit representations of secondary 
storage. 

5.2 End Systems 

Individual end systems — computers, storage systems, sensors, and other de- 
vices — are characterized by relatively small scale and a high degree of homo- 
geneity and integration. There are typically just a few tens of components (pro- 
cessors, disks, etc.); these components are mostly of the same type, and the com- 
ponents and the software that controls them have been co-designed to simplify 
management and use and to maximize performance. (Specialized devices such 
as scientific instruments may be more significantly complex, with potentially 
thousands of internal components, of which hundreds may be visible externally.) 

Such end systems represent the simplest, and most intensively studied, envi- 
ronment in which to provide the services listed above. The principal challenges 
facing developers of future systems of this type relate to changing computer ar- 
chitectures (in particular, parallel architectures) and the need to integrate end 
systems more fully into clusters, intranets, and internets. 



State of the Art The software architectures used in conventional end systems 
are well known [60] . Basic services are provided by a privileged operating system, 
which has absolute control over the resources of the computer. This operating 
system handles authentication and mediates user process requests to acquire 
resources, communicate with other processes, access files, and so on. The inte- 
grated nature of the hardware and operating system allows high-performance 
implementations of important functions such as virtual memory and I/O. 

Programmers develop applications for these end systems by using a variety of 
high-level languages and tools. A high degree of integration between processor 
architecture, memory system, and compiler means that high performance can 
often be achieved with relatively little programmer effort. 



Future Directions A significant deficiency of most end-system architectures 
is that they lack features necessary for integration into larger clusters, intranets. 
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and internets. Much current research and development is concerned with evolv- 
ing system end architectures in directions relevant to future computational grids. 
To list just three: Operating systems are evolving to support operation in clus- 
tered environments, in which services are distributed over multiple networked 
computers, rather than replicated on every processor [3], [64]. A second impor- 
tant trend is toward a greater integration of end systems (computers, disks, etc.) 
with networks, with the goal of reducing the overheads incurred at network in- 
terfaces and hence increasing communication rates [22], [35]. Finally, support for 
mobile code is starting to appear, in the form of authentication schemes, secure 
execution environments for downloaded code (“sandboxes”), and so on [32], [71], 
[70], [43], [70], [43]. 

The net effect of these various developments seems likely to be to reduce the 
currently sharp boundaries between end system, cluster, and intranet /internet, 
with the result that individual end systems will more fully embrace remote com- 
putation, as producers and/or consumers. 



5.3 Clusters 

The second class of systems that we consider is the cluster, or network of worksta- 
tions: a collection of computers connected by a high-speed local area network and 
designed to be used as an integrated computing or data processing resource. A 
cluster, like an individual end system, is a homogeneous entity — its constituent 
systems differ primarily in configuration, not basic architecture — and is con- 
trolled by a single administrative entity who has complete control over each end 
system. The two principal complicating factors that the cluster introduces are 
as follows: 

1. Increased physical scale: A cluster may comprise several hundred or thousand 
processors, with the result that alternative algorithms are needed for certain 
resource management and control functions. 

2. Reduced integration: A desire to construct clusters from commodity parts 
means that clusters are often less integrated than end systems. One impli- 
cation of this is reduced performance for certain functions (e.g., communi- 
cation) . 



State of the Art The increased scale and reduced integration of the cluster 
environment make the implementation of certain services more difficult and also 
introduce a need for new services not required in a single end system. The 
result tends to be either significantly reduced performance (and hence range 
of applications) or software architectures that modify and/or extend end-system 
operating systems in significant ways. 

We use the problem of high-performance parallel execution to illustrate the 
types of issues that can arise when we seek to provide familiar end-system ser- 
vices in a cluster environment. In a single (multiprocessor) end system, high- 
performance parallel execution is typically achieved either by using specialized 
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communication libraries such as MPI or by creating multiple threads that com- 
municate by reading and writing a shared address space. 

Both message-passing and shared-memory programming models can be im- 
plemented in a cluster. Message passing is straightforward to implement, since 
the commodity systems from which clusters are constructed typically support at 
least TCP/IP as a communication protocol. Shared memory requires additional 
effort: in an end system, hardware mechanisms ensure a uniform address space 
for all threads, but in a cluster, we are dealing with multiple address spaces. 
One approach to this problem is to implement a logical shared memory by pro- 
viding software mechanisms for translating between local and global addresses, 
ensuring coherency between different versions of data, and so forth. A variety of 
such distributed shared-memory systems exist, varying according to the level at 
which sharing is permitted [75], [24], [52]. 

In low-performance environments, the cluster developer’s job is done at this 
point; message-passing and DSM systems can be run as user- level programs 
that use conventional communication protocols and mechanisms (e.g., TCP/IP) 
for interprocessor communication. However, if performance is important, con- 
siderable additional development effort may be required. Conventional network 
protocols are orders of magnitude slower than intra-end-system communication 
operations. Low-latency, high-bandwidth inter-end-system communication can 
require modifications to the protocols used for communication, the operating 
system’s treatment of network interfaces, or even the network interface hard- 
ware [69], [55]. 

The cluster developer who is concerned with parallel performance must also 
address the problem of coscheduling. There is little point in communicating 
extremely rapidly to a remote process that must be scheduled before it can 
respond. Coscheduling refers to techniques that seek to schedule simultaneously 
the processes constituting a computation on different processors [23], [62]. In 
certain highly integrated parallel computers, coscheduling is achieved by using a 
batch scheduler: processors are space shared, so that only one computation uses 
a processor at a time. Alternatively, the schedulers on the different systems can 
communicate, or the application itself can guide the local scheduling process to 
increase the likelihood that processes will be coscheduled [3], [14]. 

To summarize the points illustrated by this example: in clusters, the im- 
plementation of services taken for granted in end systems can require new ap- 
proaches to the implementation of existing services (e.g., interprocess commu- 
nication) and the development of new services (e.g., DSM and coscheduling). 
The complexity of the new approaches and services, as well as the number of 
modifications required to the commodity technologies from which clusters are 
constructed, tends to increase proportionally with performance requirements. 

We can paint a similar picture in other areas, such as process creation, process 
control, and I/O. Experience shows that familiar services can be extended to the 
cluster environment without too much difficulty, especially if performance is not 
critical; the more sophisticated cluster systems provide transparent mechanisms 
for allocating resources, creating processes, controlling processes, accessing files. 
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and so forth, that work regardless of a program’s location within the cluster. 
However, when performance is critical, new implementation techniques, low-level 
services, and high-level interfaces can be required [64], [25]. 



Future Directions Cluster architectures are evolving in response to three pres- 
sures: 

1. Performance requirements motivate increased integration and hence operat- 
ing system and hardware modifications (for example, to support fast com- 
munications) . 

2. Changed operational parameters introduce a need for new operating system 
and user-level services, such as coscheduling. 

3. Economic pressures encourage a continued focus on commodity technologies, 
at the expense of decreased integration and hence performance and services. 

It seems likely that, in the medium term, software architectures for clusters 
will converge with those for end systems, as end-system architectures address 
issues of network operation and scale. 



5.4 Intranets 

The third class of systems that we consider is the intranet, a grid comprising a 
potentially large number of resources that nevertheless belong to a single organi- 
zation. Like a cluster, an intranet can assume centralized administrative control 
and hence a high degree of coordination among resources. The three principal 
complicating factors that an intranet introduces are as follows: 

1. Heterogeneity: The end systems and networks used in an intranet are al- 
most certainly of different types and capabilities. We cannot assume a single 
system image across all end systems. 

2. Separate administration: Individual systems will be separately administered; 
this feature introduces additional heterogeneity and the need to negotiate 
potentially conflicting policies. 

3. Lack of global knowledge: A consequence of the first two factors, and the 
increased number of end systems, is that it is not possible, in general, for 
any one person or computation to have accurate global knowledge of system 
structure or state. 



State of the Art The software technologies employed in intranets focus pri- 
marily on the problems of physical and administrative heterogeneity. The result 
is typically a simpler, less tightly integrated set of services than in a typical 
cluster. Commonly, the services that are provided are concerned primarily with 
the sharing of data (e.g., distributed file systems, databases, Web services) or 
with providing access to specialized services, rather than with supporting the co- 
ordinated use of multiple resources. Access to nonlocal resources often requires 
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the use of simple, high-level interfaces designed for “arm’s-length” operation 
in environments in which every operation may involve authentication, format 
conversions, error checking, and accounting. Nevertheless, centralized adminis- 
trative control does mean that a certain degree of uniformity of mechanism and 
interface can be achieved; for example, all machines may be required to run a 
specific distributed file system or batch scheduler, or may be placed behind a 
firewall, hence simplifying security solutions. 

Software architectures commonly used in intranets include the Distributed 
Computing Environment (DCE), DCOM, and CORBA. In these systems, pro- 
grams typically do not allocate resources and create processes explicitly, but 
rather connect to established “services” that encapsulate hardware resources or 
provide defined computational services. Interactions occur via remote procedure 
call [33] or remote method invocation [54], [36], models designed for situations 
in which the parties involved have little knowledge of each other. Communica- 
tions occur via standardized protocols (typically layered on TCP/IP) that are 
designed for portability rather than high performance. In larger intranets, partic- 
ularly those used for mission-critical applications, reliable group communication 
protocols such as those implemented by ISIS [7] and Totem [45] can be used to 
deal with failure by ordering the occurrence of events within the system. 

The limited centralized control provided by a parent organization can allow 
the deployment of distributed queuing systems such as Load Sharing Facility 
(LSF), Codine, or Condor, hence providing uniform access to compute resources. 
Such systems provide some support for remote management of computation, for 
example, by distributing a limited range of signals to processes through local 
servers and a logical signal distribution network. However, issues of security, 
payment mechanisms, and policy often prevent these solutions from scaling to 
large intranets. 

In a similar fashion, uniform access to data resources can be provided by 
means of wide area file system technology (such as DFS), distributed database 
technology, or remote database access (such as SQL servers). High-performance, 
parallel access to data resources can be provided by more specialized systems 
such as the High Performance Storage System [72]. In these cases, the interfaces 
presented to the application would be the same as those provided in the cluster 
environment. 

The greater heterogeneity, scale, and distribution of the intranet environment 
also introduce the need for services that are not needed in clusters. For exam- 
ple, resource discovery mechanisms may be needed to support the discovery of 
the name, location, and other characteristics of resources currently available on 
the network. A reduced level of trust and greater exposure to external threats 
may motivate the use of more sophisticated security technologies. Here, we can 
once again exploit the limited centralized control that a parent organization can 
offer. Solutions such as Kerberos [50] can be mandated and integrated into the 
computational model, providing a unified authentication structure throughout 
the intranet. 
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Future Directions Existing intranet technologies do a reasonable job of pro- 
jecting a subset of familiar programming models and services (procedure calls, 
file systems, etc.) into an environment of greater complexity and physical scale, 
but are inadequate for performance-driven applications. We expect future de- 
velopments to overcome these difficulties by extending lighter-weight interac- 
tion models originally developed within clusters into the more complex intranet 
environment, and by developing specialized performance-oriented interfaces to 
various services. 

5.5 Internets 

The final class of systems that we consider is also the most challenging on which 
to perform network computing — internetworked systems that span multiple or- 
ganizations. Like intranets, internets tend to be large and heterogeneous. The 
three principal additional complicating factors that an internet introduces are 
as follows: 

1. Lack of centralized control: There is no central authority to enforce opera- 
tional policies or to ensure resource quality, and so we see wide variation in 
both policy and quality. 

2. Geographical distribution: Internets typically link resources that are geo- 
graphically widely distributed. This distribution leads to network perfor- 
mance characteristics significantly different from those in local area or met- 
ropolitan area networks of clusters and intranets. Not only does latency scale 
linearly with distance, but bisection bandwidth arguments [18], [26] suggest 
that accessible bandwidth tends to decline linearly with distance, as a result 
of increased competition for long-haul links. 

3. International issues: If a grid extends across international borders, export 
controls may constrain the technologies that can be used for security, and so 
on. 



State of the Art The internet environment’s scale and lack of central con- 
trol have so far prevented the successful widespread deployment of grid services. 
Approaches that are effective in intranets often break down because of the in- 
creased scale and lack of centralized management. The set of assumptions that 
one user or resource can make about another is reduced yet further, a situation 
that can lead to a need for implementation techniques based on discovery and 
negotiation. 

We use two examples to show how the internet environment can require new 
approaches. We first consider security. In an intranet, it can be reasonable to as- 
sume that every user has a preestablished trust relationship with every resource 
that he wishes to access. In the more open internet environment, this assumption 
becomes intractable because of the sheer number of potential process-to-resource 
relationships. This problem is accentuated by the dynamic and transient nature 
of computation, which makes any explicit representation of these relationships 
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infeasible. Free-flowing interaction between computations and resources requires 
more dynamic approaches to authentication and access control. One potential 
solution is to introduce the notion of delegation of trust into security relation- 
ships; that is, we introduce mechanisms that allow an organization A to trust a 
user U because user U is trusted by a second organization B, with which A has 
a formal relationship. However, the development of such mechanisms remains a 
research problem. 

As a second example, we consider the problem of coscheduling. In an intranet, 
it can be reasonable to assume that all resources run a single scheduler, whether 
a commercial system such as LSF or a research system such as Condor. Hence, 
it may be feasible to provide coscheduling facilities in support of applications 
that need to run on multiple resources at once. In an internet, we cannot rely 
on the existence of a common scheduling infrastructure. In this environment, 
coscheduling requires that a grid application (or scheduling service acting for 
an application) obtain knowledge of the scheduling policies that apply on dif- 
ferent resources and influence the schedule either directly through an external 
scheduling API or indirectly via some other means [16]. 



Future Directions Future development of grid technologies for internet envi- 
ronments will involve the development of more sophisticated grid services and 
the gradual evolution of the services provided at end systems in support of 
those services. There is little consensus on the shape of the grid architectures 
that will emerge as a result of this process, but both commercial technologies 
and research projects point to interesting potential directions. Three of these 
directions — commodity technologies, Legion, and Globus — are explored in de- 
tail in later chapters. We note their key characteristics here but avoid discussion 
of their relative merits. There is as yet too little experience in their use for such 
discussion to be meaningful. 

The commodity approach to grid architecture adopts as the basis for grid 
development the vast range of commodity technologies that are emerging at 
present, driven by the success of the Internet and Web and by the demands of 
electronic information delivery and commerce. These technologies are being used 
to construct three-tier architectures, in which middle-tier application servers me- 
diate between sophisticated back-end services and potentially simple front ends. 
Grid applications are supported in this environment by means of specialized 
high-performance back-end and application servers. 

The Legion approach to grid architecture seeks to use object-oriented design 
techniques to simplify the definition, deployment, application, and long-term 
evolution of grid components. Hence, the Legion architecture defines a complete 
object model that includes abstractions of compute resources called host objects, 
abstractions of storage systems called data vault objects, and a variety of other 
object classes. Users can use inheritance and other object-oriented techniques to 
customize the behavior of these objects to their own particular needs, as well as 
develop new objects. 
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The Globus approach to grid architecture is based on two assumptions: 

1. Grid architectures should provide basic services but not prescribe particular 
programming models or higher-level architectures. 

2. Grid applications require services beyond those provided by today’s com- 
modity technologies. 

Hence, the focus is on defining a “toolkit” of low-level services for security, 
communication, resource location, resource allocation, process management, and 
data access. These services are then used to implement higher-level services, 
tools, and programming models. 

In addition, hybrids of these different architectural approaches are possible 
and will almost certainly be addressed; for example, a commodity three-tier 
system might use Globus services for its back end. 

A wide range of other projects are exploring technologies of potential rele- 
vance to computational grids, for example, WebOS [66], Charlotte [6], UFO [2], 
ATLAS [5], Javelin [15], Popcorn [10], and Globe [68]. 

6 Research Challenges 

What problems must be solved to enable grid development? In preceding sec- 
tions, we outlined what we expect grids to look like and how we expect them 
to be used. In doing so, we tried to be as concrete as possible, with the goal of 
providing at least a plausible view of the future. However, there are certainly 
many challenges to be overcome before grids can be used as easily and flexibly as 
we have described. In this section, we summarize the nature of these challenges, 
most of which are discussed in much greater detail in the chapters that follow. 



6.1 The Nature of Applications 

Early metacomputing experiments provide useful clues regarding the nature of 
the applications that will motivate and drive early grid development. However, 
history also tells us that dramatic changes in capabilities such as those discussed 
here are likely to lead to radically new ways of using computers — ways as yet 
unimagined. Research is required to explore the bounds of what is possible, 
both within those scientific and engineering domains in which metacomputing 
has traditionally been applied, and in other areas such as business, art, and 
entertainment. 



6.2 Programming Models and Tools 

As noted in Section 4, grid environments will require a rethinking of existing 
programming models and, most likely, new thinking about novel models more 
suitable for the specific characteristics of grid applications and environments. 
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Within individual applications, new techniques are required for expressing ad- 
vanced algorithms, for mapping those algorithms onto complex grid architec- 
tures, for translating user performance requirements into system resource re- 
quirements, and for adapting to changes in underlying system structure and 
state. Increased application and system complexity increases the importance 
of code reuse, and so techniques for the construction and composition of grid- 
enabled software components will be important. Another significant challenge 
is to provide tools that allow programmers to understand and explain program 
behavior and performance. 



6.3 System Architecture 

The software systems that support grid applications must satisfy a variety of 
potentially conflicting requirements. A need for broad deployment implies that 
these systems must be simple and place minimal demands on local sites. At the 
same time, the need to achieve a wide variety of complex, performance-sensitive 
applications implies that these systems must provide a range of potentially so- 
phisticated services. Other complicating factors include the need for scalability 
and evolution to future systems and services. It seems likely that new approaches 
to software architecture will be needed to meet these requirements — approaches 
that do not appear to be satisfied by existing Internet, distributed computing, 
or parallel computing technologies. 



6.4 Algorithms and Problem-Solving Methods 

Grid environments differ substantially from conventional uniprocessor and paral- 
lel computing systems in their performance, cost, reliability, and security charac- 
teristics. These new characteristics will undoubtedly motivate the development 
of new classes of problem-solving methods and algorithms. Latency-tolerant and 
fault-tolerant solution strategies represent one important area in which research 
is required [5], [6], [10]. Highly concurrent and speculative execution techniques 
may be appropriate in environments where many more resources are available 
than at present. 



6.5 Resource Management 

A defining feature of computational grids is that they involve sharing of networks, 
computers, and other resources. This sharing introduces challenging resource 
management problems that are beyond the state of the art in a variety of areas. 
Many of the applications described in later chapters need to meet stringent 
end-to-end performance requirements across multiple computational resources 
connected by heterogeneous, shared networks. To meet these requirements, we 
must provide improved methods for specifying application-level requirements, for 
translating these requirements into computational resources and network-level 
quality-of-service parameters, and for arbitrating between conflicting demands. 




Computational Grids 



31 



6.6 Security 

Sharing also introduces challenging security problems. Traditional network secu- 
rity research has focused primarily on two-party client-server interactions with 
relatively low performance requirements. Grid applications frequently involve 
many more entities, impose stringent performance requirements, and involve 
more complex activities such as collective operations and the downloading of 
code. In larger grids, issues that arise in electronic markets become important. 
Users may require assurance and licensing mechanisms that can provide guar- 
antees (backed by financial obligations) that services behave as advertised. 

6.7 Instrumentation and Performance Analysis 

The complexity of grid environments and the performance complexity of many 
grid applications make techniques for collecting, analyzing, and explaining per- 
formance data of critical importance. Depending on the application and comput- 
ing environment, poor performance as perceived by a user can be due to any one 
or a combination of many factors: an inappropriate algorithm, poor load balanc- 
ing, inappropriate choice of communication protocol, contention for resources, 
or a faulty router. Significant advances in instrumentation, measurement, and 
analysis are required if we are to be able to relate subtle performance problems in 
the complex environments of future grids to appropriate application and system 
characteristics. 

6.8 End Systems 

Grids also have implications for the end systems from which they are constructed. 
Today’s end systems are relatively small and are connected to networks by in- 
terfaces and with operating system mechanisms originally developed for reading 
and writing slow disks. Grids require that this model evolve in two dimensions. 
First, by increasing demand for high-performance networking, grid systems will 
motivate new approaches to operating system and network interface design in 
which networks are integrated with computers and operating systems at a more 
fundamental level than is the case today. Second, by developing new applica- 
tions for networked computers, grids will accelerate local integration and hence 
increase the size and complexity of the end systems from which they are con- 
structed. 

6.9 Network Protocols and Infrastructure 

Grid applications can be expected to have significant implications for future 
network protocols and hardware technologies. Mainstream developments in net- 
working, particularly in the Internet community, have focused on best-effort ser- 
vice for large numbers of relatively low-bandwidth flows. Many of the future grid 
applications discussed in this book require both high bandwidths and stringent 
performance assurances. Meeting these requirements will require major advances 
in the technologies used to transport, switch, route, and manage network flows. 
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7 Summary 

This chapter has provided a high-level view of the expected purpose, shape, and 
architecture of future grid systems and, in the process, sketched a road map for 
more detailed technical discussion in subsequent chapters. The discussion was 
structured in terms of six questions. 

Why do we need computational grids? We explained how grids can enhance 
human creativity by, for example, increasing the aggregate and peak computa- 
tional performance available to important applications and allowing the coupling 
of geographically separated people and computers to support collaborative engi- 
neering. We also discussed how such applications motivate our requirement for 
a software and hardware infrastructure able to provide dependable, consistent, 
and pervasive access to high-end computational capabilities. 

What types of applications will grids be used for? We described five classes 
of grid applications: distributed supercomputing, in which many grid resources 
are used to solve very large problems; high throughput, in which grid resources 
are used to solve large numbers of small tasks; on demand, in which grids are 
used to meet peak needs for computational resources; data intensive, in which 
the focus is on coupling distributed data resources; and collaborative, in which 
grids are used to connect people. 

Who will use grids? We examined the shape and concerns of four grid com- 
munities, each supporting a different type of grid: a national grid, serving a 
national government; a private grid, serving a health maintenance organization; 
a virtual grid, serving a scientific collaboratory; and a public grid, supporting a 
market for computational services. 

How will grids be used? We analyzed the requirements of five classes of users 
for grid tools and services, distinguishing between the needs and concerns of 
end users, application developers, tool developers, grid developers, and system 
managers. 

What is involved in building a grid? We discussed potential approaches to 
grid architecture, distinguishing between the differing concerns that arise and 
technologies that have been developed within individual end systems, clusters, 
intranets, and internets. 

What problems must be solved to enable grid development? We provided a 
brief review of the research challenges that remain to be addressed before grids 
can be constructed and used on a large scale. 

Further Reading 

For more information on the topics covered in this chapter, see www.mkp.com/ 
grids and also the following references: 

— A series of books published by the Corporation for National Research Ini- 
tiatives [29], [30], [31], [28] review and draw lessons from other large-scale 
infrastructures, such as the electric power grid, telecommunications network, 
and banking system. 
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~ Catlett and Smarr’s original paper on metacomputing [13] provides an early 
vision of how high-performance distributed computing can change the way 
in which scientists and engineers use computing. 

— Papers in a 1996 special issue of the International Journal of Supercomputer 
Applications [19] describe the architecture and selected applications of the 
I- WAY metacomputing experiment. 

— Papers in a 1997 special issue of the Communications of the ACM [61] de- 
scribe plans for a National Technology Grid. 

— Several reports by the National Research Council touch upon issues relevant 
to grids [48], [49], [47]. 

— Birman and van Renesse [8] discuss the challenges that we face in achieving 
reliability in grid applications. 
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Abstract. The paper describes TENT, a component-based framework 
for the integration of technical applications. TENT allows the engineer 
to design, automate, control, and steer technical workflows interactively. 
The applications are therefore encapsulated in order to build compo- 
nents which conform to the TENT component architecture. The engineer 
can combine the components to workflows in a graphical user interface. 
The framework manages and controls a distributed workflow on arbi- 
trary computing resources within the network. Due to the utilization 
of CORBA, TENT supports all state-of-the-art programming languages, 
operating systems, and hardware architectures. It is designed to deal with 
parallel and sequential programming paradigms, as well as with massive 
data exchange. TENT is used for workflow integration in several projects, 
for CFD workflows in turbine engine and aircraft design, in the modeling 
of combustion chambers, and for virtual automobile prototyping. 



1 Introduction 

The design goal of TENT is the integration of all tools which belong to the typi- 
cal workflows in a computer aided engineering (CAE) environment. The engineer 
should be enabled to configure, steer, and control interactively his personal pro- 
cess chain. The workflow components can run on arbitrary computing resources 
within the network. We achieved this goals by designing and implementing a 
component-based framework for the integration of technical applications. 

TENT is used as integration platform in several projects. In the SUPEA 
project we developed a simulation environment for the analysis of turbocompo- 
nents in gas turbines. An integrative environment for the development of virtual 
automobile prototypes is under construction at the AUTOBENCH project at 
GMD. In the AMANDA project, TENT is used for the integration of multidis- 
ciplinary tools for the simulation of the static aeroelastic of a complete aircraft. 
Finally, at the BKM project several physical models for the simulation of com- 
bustion chambers are coupled with the central simulation code using TENT. 

From the requirements of these projects we can derive the main requirements 
of the integration system: 
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— Easy integration of existing CFD-codes, finite element- codes, pre- and post- 
processors; 

— Integration of sequential and parallel codes (MPI, PVM, or HPF); 

— Efficient data exchange between the tools; 

— Integration of tightly coupled codes; 

~ Interactive control and design of the workflows; 

— Interactive control of the simulation. 

1.1 Related Work 

The development of frameworks, integration, or problem solving environments 
(PSE) is focused in many scientific projects. The interactive control of simula- 
tions in combination with virtual reality environments is demonstrated within 
the collaborative visualization and simulation environment COVISE [8]. Actual 
projects concentrate on the utilization of CORBA [4] for the organization of 
distributed systems. PARDIS [5] introduced an improved data transfer mecha- 
nism for parallel applications by defining new parallel data objects in CORBA. 
A very similar approach is addressed in the ParCo (parallel CORBA) project 
[II]. The integration framework TENT is a CORBA-based environment, which 
defines its own component architecture for a distributed environment and a new 
data exchange interface for efficient parallel communication on top of CORBA. 




Fig. 1. Packages of the Integration System TENT 



2 Base Concepts of the Framework 

The TENT framework consists of four different packages. This structure is dis- 
played in figure I. The Software development kit summarizes all interface defi- 
nitions and libraries for the software development with TENT. The base system 
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includes all basic services needed to run a system integrated with TENT. The 
facilities are a collection of high-level services and the components consist of 
wrappers and special services for the integration of applications as TENT - 
Components. 

2.1 The Software Development Kit 

TENT defines a component architecture and an application component inter- 
face on top of the CORBA object model. The TENT component architecture is 
inspired by the JavaBeans specihcation [3] and the Data Interchange and Syner- 
gistic Collateral Usage Study framework (DISCUS) [15]. The interface hierarchy 
is shown in figure 2. The data exchange interface as part of the SDK allows the 
parallel data exchange between two TENT - components. It can invoke a data 
converter, which automatically converts CED-data between the data formats 
supported by the integrated tools. 




Fig. 2. TENT Eramework Interface Hierarchy (UML notation) 



2.2 TENT - Base System 

The TENT system components form an engineering environment out of a bunch 
of stand alone tools. All system components are implemented in Java. The Mas- 
ter Control Process (MCP) is the main entity within TENT. It realizes and 
controls system tasks such as process chain management, construction and de- 
struction of components, or monitoring the state of the system. 

The factories run on every machine in the TENT framework. They start, 
control, and finish the applications on their machine. The name service is the 
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standard CORBA service. The relations between the services are shown in fig- 
ure 3. 

2.3 TENT Facilities 

In order to support high-level workflows the system offers more sophisticated fa- 
cilities. For coupling multi-disciplinary simulation codes, working on numerical 
grids, TENT will include a coupling server. This server offers the full functional- 
ity of the conpling library MpCCI^ (former CoColib) [1]. MpCCI coordinates the 
exchange of boundary informations between simulation applications working on 
neighbouring grids. We are working on a data-management service, which stores 
and organizes the simulation data of typical CFD or finite element simulations. 
For the virtual automobile prototypes a data-server is already in production, 
that can hold the data sets for allowing a feedback-loop between a virtual re- 
ality environment, a crash simulator and the appropriate preprocessing tools. 
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Fig. 3. Architectural Overview of TENT 



2.4 TENT Components 

The applications must be encapsulated by wrappers to access their functionality 
in TENT by CORBA calls. Due to the level of the accessibility of the sources, 
the wrapper can be tightly coupled with the application, e.g. linked together 
with it, or must be a stand alone tool, that starts the application via system 
services. Depending on the wrapping mechanism the communication between 
the wrapper and the application is implemented by, e.g., direct memory access, 
IPC mechanisms, or file exchange. 

^ MpCCI is a trademark owned by GMD. 
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3 Applications 

In several projects the TENT framework is used for the integration of engineering 
workflows. Originally TENT was developed in the SUPEA project for integrating 
simulation codes to build a numerical testbed for gas turbines. 




Fig. 4. TENT Control GUI with (from left to right, beginning in the upper left 
corner) Component Repository, Plot Control in the Property Panel, Component 
Controls, Wire Panel, and Logger Information 



In this project several preprocessors, simulation tools, postprocessors, and 
visualization tools are integrated by TENT, and can be combined to form work- 
flows. A typical SUPEA workflow is designed in the GUI snapshot at figure 4. 
The workflow is displayed at the Wire Panel. Each icon shows one application in 
the workflow. The wires between the icons describe the control flow. The most 
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left icon is the master control process which starts and control the flow. It starts 
the simulation engine, which calculates the flow state. After a user defined num- 
ber of iteration steps the simulation engine sends the actual physical data set to 
a postprocessor, which finally sends a visualizable data file to the visualizer at 
the end of the workflow. After the first run the workflow repeats as often as the 
user choose in the bottom line of the GUI. 

The workflow can be freely designed by dragging the applications displayed in 
the Component Repository and dropping them to the Wire Panel. Here the con- 
trol flow is defined by wiring the icons interactively. Clicking on the icons in 
the Wire Panel shows their editable properties in the Component Control of the 
Property Panel in the upper right corner of the GUI. In figure 4 the Plot Control 
is active, displaying a 2D plot for an arbitrary choosable property of the clicked 
component. 

In the AMANDA project TENT will be used for the integration of sev- 
eral simulation, pre-, postprocessing, control, and visualization tools. The aim 
is to form an environment for the multi-disciplinary aeroelastic simulation of 
an aircraft in flight manoeuvres. Therefore the CFD-simulation of the surround- 
ing flow must be coupled with a hnite element analysis of the wing structure. 
Logic components will be included in TENT to allow dynamic workflows. In the 
next paragraph the aspect of software integration at the AMANDA project is 
described in more detail. 

The virtual automobile prototype is the aim of the project AUTOBENCH 
[13]. In this project TENT is used to allow the interactive control of a crash 
simulation from within a virtual reality environment. 



3.1 The AMANDA Workflows 

Airplane Design. For the simulation of a trimmed, freely flying, elastic air- 
plane the following programs have to be integrated into TENT: 

— a CFD code, TAU [12] and FLOWer [6], 

— a structural mechanics code (NASTRAN[10]) and a multi-body program 
(SIMPACK [7]), to calculate the deformation of the structure, 

— a coupling module, to control the coupled fluid-structure computation, 

— a grid deformation module, to calculate a new grid for the CFD simulation, 

— a flight mechanics/controls module (build using an Object-Oriented mod- 
elling approach [9]), to set the aerodynamic control surface positions, 

— visualization tools. 

Figure 5 shows a complete AMANDA airplane design workflow to be integrated 
in TENT. In this case TAU is chosen for the CFD simulation. The process chain 
is hierarchically structured. The lowest level contains the parallel CFD solver 
only. This CFD-subsystem consists of the flow solver itself and auxiliary pro- 
grams to handle mesh adaptation. The next level, the aeroelastic-subsystem, 
comprises the CFD solver, the structural mechanics or multi-body code, and the 
grid deformation module. For the solution of the coupled non-linear aeroelastic 
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Fig. 5. Process chain for coupled CFD/structural mechanics/flight control sys- 
tem. 



equations a staggered algorithm is implemented in the control process [2]. The 
highest level consists of the flight mechanics module coupled to the aeroelastic- 
subsystem. Each single code is additionally accompanied by its own visualization 
tool. The computation of a stable flight state typically proceeds as follows: 
Starting by calculating the flow around the airplane, the pressure forces on the 
wings and the nacelle are derived. Theses forces are transfered to the struc- 
tural mechanics code and interpolated to the structural mechanics grid using the 
MpCCI library. This code calculates the deformation of the structure which in 
turn influences the flow field around the airplane. At a final stage it is planned to 
extend the system and feed the calculated state into a flight mechanics/controls 
module to set the control surfaces accordingly and to obtain a stable flight 
position. This changes the geometry of the wings and requires therefore the 
recalculation of the flow field and the deformation. 



Turbine Design. A new aspect in virtual turbine design is the simulation of 
flow inside the turbine in consideration of the heat load on the blades and the 
cooling. The numerical modeling of the position and size of cooling channels and 
holes in the blades are essential for the layout of an air-cooled turbine. 

The CFD code TRACE [14], a 3D-Navier-Stokes-Solver for the simulation of 
steady and unsteady multistage turbomachinery applications, and a heat con- 





The Distributed Engineering Framework TENT 



45 



duction problem solver (NASTRAN) are coupled to simulate the airflow through 
the turbine together with the temperature distribution inside the blades. For the 
coupling of the flow simulation and the heat conduction a stable coupling algo- 
rithm as been developed where TRACE delivers the temperatures of the air 
surrounding the blade and the heat transfer coefficients as boundary conditions 
to NASTRAN which in turn calculates the temperature distribution inside the 
blade. The temperature at the surface of the blade is returned to TRACE as 
boundary condition for the airflow. 



4 Conclusions 

The integration framework TENT allows a high-level integration of engineering 
applications and supporting tools in order to form static as well as dynamically 
changeable workflows. TENT sets up, controls, and steers the workflow on a 
distributed computing environment. It allows the integration of parallel, or se- 
quential code in most common programming languages and on most common 
operating systems. CORBA is used for communication and distributed method 
invocation. Additionally a fast parallel data exchange interface is available. At 
the end of 1999 TENT will be available as a freely distributed software via the 
Internet. TENT is already used for integration in several scientific engineering 
projects with an industrial background. 
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Abstract. The Virtual Reality Modeling Language, VRML97, allows 
the description of dynamic worlds that are responsive to user interaction. 
However, the serial nature of current VRML browsers prevents the full 
potential of the language from being realised: they do not have the power 
to support huge, complex worlds with large numbers of interacting users. 
This paper presents the design of a scalable, parallel VRML server that 
has been built to overcome this barrier. The server distributes the task 
of storing and computing the changing state of the world across a cluster 
of workstations. Clients connect to the server and receive information on 
their current view of the world, which they can then render. The parallel 
server is implemented in Java, utilising a new active object model called 
SODA (System Of Dynamic Active Objects) that is also described in the 
paper. 



1 Introduction 

The Virtual Reality Modeling Language (VRML) [5] allows three-dimensional 
worlds to be described in a platform-independent notation. A world description 
can be downloaded over the Internet into a VRML browser, which performs the 
audio-visual presentation and provides a set of navigation paradigms to the user. 
The first version of VRML supported only static and non-interactive^ worlds. 
With the later VRML97 standard [6] “moving worlds” are supported, which can 
be both dynamic over time and responsive to user interaction. It is therefore 
possible to envisage the creation of huge, complex worlds with thousands of 
interacting users. For example, models of cities could be built to include moving 
vehicles as well as static buildings. In the future, with advances in traffic sensing 
technology, it may even be possible to build models of real cities that show 
accurate traffic flows in real-time. 

1.1 Problem Statement 

Despite the obvious potential, VRML worlds available on the Internet have so far 
been relatively limited in terms of scale, movement and interaction. The reason 

^ With the exception of “clickable” geometry, which can be used for linking to other 
static worlds or hypertext documents by spawning an associated URL. 
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can be found in the nature of the VRML usage model, in which the description 
of the VRML world is downloaded into the user’s browser and evaluated locally. 
This explains several limiting factors: 

(1) Downloading the complete world description into the VRML browser takes 
a long time for large worlds. Once downloaded, the VRML browser main- 
tains dynamic world state, possibly generating huge memory demands on 
the user’s machine. 

(2) The user’s desktop machine must both continually update the state of the 
world — as objects move and the user interacts — and also render a view of 
the world in the browser window. If large, complex worlds require a greater 
processing power than available locally, the user will witness slow, jerky 
movements, resulting in a low-fidelity experience. 

(3) Because the world runs locally — in the user’s browser — there is no possibil- 
ity of interaction between different users in the same world. This precludes 
both direct interaction (3a) between users who meet each other in the 
world, but also indirect interaction (3b)^. 

(4) It is difficult to arrange for the world to change in response to external 
events. For example, if a virtual world models the current state of part of 
the real world (e.g., traffic flow in a city, footballers playing on a pitch) 
then we would wish to move the virtual world’s objects to reflect real-time 
changes to the objects in the real world. 

1.2 Related Work 

There have been several attempts to overcome a subset of these limiting factors. 

VNet [14] is a multi-user client-server system that enables shared VRML 
worlds. Each client runs a VRML browser and produces a message whenever 
the user changes position and orientation. These notification messages are re- 
layed through the server and broadcast to all other participants, which update 
the sender’s avatar position and orientation accordingly. This approach allows 
restricted direct interaction (3a) between multiple users. Other limiting factors 
(1, 2, 3b, 4) are not addressed. 

Several commercially available multi-user VRML implementations, like 
Blaxxun Community Server [3] and Sony Community-Place [12] extend this idea 
by providing shared events. Shared events allow for world state synchronisation, 
which goes beyond mutual notification of avatar movements, enabling indirect 
interaction (3b). In addition, some systems provide limited support for server- 
driven “robots” , which inform all connected clients about their behaviour. Still, 
the main world update responsibility rests with the clients; factors (1) and (2) 
are not addressed. 

ActiveWorlds [1] provides large-scale virtual environments scalable to thou- 
sands of users. However, worlds in this client-server system are largely static. 

For instance, when one user introduces/builds a structure that can be seen by others. 

The VRML usage model does not allow any persistence. 
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Changes are infrequent and limited to the addition or removal of objects from 
the world. Download time (1) is addressed through distance-based filtering: the 
server only transmits information about (static) world segments in the path of 
the viewer. (2) and (3) are not applicable here, as no continuous world update 
takes place^. Therefore, scalability is achieved by limiting world dynamics and 
the full generality of VRML97. 

All systems mentioned so far use a central server as a communication relay. 
In contrast, SIMNET [13] and NPSNET [11] implement a peer-to-peer architec- 
ture. The DIS- Java- VRML working group^ proposes a similar architecture based 
on VRML97, Java and the Distributed Interactive Simulation (DIS) standard 
[4]. World state in these systems is maintained and updated in a distributed 
fashion. Each participant is responsible for evaluating a set of objects it “owns” , 
and broadcasting state changes to all peers. (1) is solved by an object reposi- 
tory, locally accessible to all participants. (2) is dealt with by distributed world 
state update. Dead-reckoning techniques [10] are used to minimise communica- 
tion. However, there is no protection against malicious participants, delivering 
inconsistent information. Therefore, this approach is unsuitable in an Internet 
setting with anonymous participants. 



1.3 Our Approach 

This paper presents the design of a system that has been built with the aim of 
directly addressing (l)-(4), and so supporting huge, complex worlds filled with 
large numbers of interacting users. Our design provides a client-server imple- 
mentation of VRML, in which a server holds the state of the virtual world. 
The state changes over time as objects move and users interact with the world. 
Many clients, each running a VRML browser, can connect to the one server and 
so share a single world, interacting with each other as required. When a client 
first connects to the server, it receives only the set of geometric objects that 
are visible from its initial starting position. This minimises download time. As 
the viewer moves and interacts with the world, it receives from the server up- 
dates to the position of any geometric objects in the field of vision that have 
changed, and also information on any geometric objects that have become visi- 
ble. Therefore, the client has little work to do other than render its view of the 
world. It makes sense to leave rendering to the client as current workstations 
have powerful graphics hardware dedicated to this task. 

A consequence of this architecture is that when supporting complex systems 
with many users, the server will have much work to do. We do not want to just 
move the system bottleneck from the client to the server, and so we have designed 
a scalable, parallel VRML server that allows the work of computing the state of 
the world, and supporting clients, to be spread over a cluster of workstations. The 
parallel server is implemented in Java, utilising a new, distributed active object 

® Except for very simple, repetitive animations. 

See http://www.web3d.org/WorkingGroups/vrtp/dis-java-vrml/ 
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programming model and runtime system called SODA (System Of Dynamic 
Active Objects). 

2 VRML Execution Model 

In this section we give a basic overview of the VRML97 execution model, focusing 
on those aspects that are important for the design of a parallel implementation. 

2.1 Basic Terminology 

A VRML world is described as an acyclic and directed scene graph populated by 
nodes of various types and defined in one or more textual files. The scene graph 
is hierarchically structured through grouping nodes, which may contain other 
nodes as descendants; a Transform node, for example, describes geometrical 
transformations that influence all descendant geometric nodes. Non-grouping 
nodes can only appear as leaves in the tree. 

VRML97 has 54 pre-defined node types, abstracting from various real-world 
objects and concepts. They reach from basic shapes and geometry, over grouping 
nodes and light sources to audio effects. Every node type stores its state in one 
or more typed fields. Examples are a Transform node’s translation, orientation 
and scaling fields, a Material's colour and a SpotLight's intensity. 

Other nodes are responsible for driving and controlling the dynamic be- 
haviour of a scene, namely Sensor nodes, various Interpolator nodes and Script 
nodes. 

Sensor nodes are distinguished in that they are reactive to the passage of 
time or to user interaction (e.g., “touching” of objects, user proximity, etc.). If 
stimulated, a sensor node dispatches an event on one or more of its eventOut 
fields (e.g., a TimeSensor can send an event at regular time intervals on its 
cycleTime eventOut field). All events comprise a typed value and a timestamp, 
which is determined by the sensor’s trigger time. Events can be propagated 
from the producing eventOut held along routes to the eventin helds of other 
nodes. Upon receiving an event, nodes may change their state, perform event 
processing or generate additional events. Routes are determined by the edges 
of a directed routing graph that mediates one-way event notihcation between 
nodes. The structure of this routing graph is completely orthogonal to the scene 
graph hierarchy. 

Event processing at a node can take the form of simple key-framed inter- 
polation as this is done by interpolator nodes. Script nodes are more powerful 
in that they allow arbitrary, author-dehned event processing and generation. A 
world author can associate a Java or JavaScript function with each eventin held. 

2.2 Event Cascades 

During event processing, a node may not only change its state, but also generate 
additional events. In this manner, a single sensor event can trigger an event 
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cascade involving a subset of the routing graph’s edges. All events in an event 
cascade are considered to occur simultaneously and therefore carry the same 
timestamp as the initial sensor event. To prevent infinite loops in a cyclic routing 
graph, every eventOut is limited to at most one event per timestamp®. 

Ideally, all events would be processed instantaneously in the order that they 
are generated. However, in a real-world implementation, there will always be pro- 
cessing delays. Furthermore, sensor nodes may generate events more frequently 
than the resulting event cascades can be evaluated. VRML97 addresses this is- 
sue by requiring implementations to evaluate events in timestamp order. This 
ensures that implementations produce deterministic results. 

Multiple eventOuts may route to the same eventin in a fan-in configuration. 
If events with the same timestamp arrive, they “shall be processed, but the order 
of evaluation is implementation dependent.” ([6], paragraph 4.10.5) 

2.3 Discrete and Continuous Events 

Most events produced during world execution are discrete: they happen at well- 
defined world times, e.g. as determined by the time of user interaction. However, 
TimeSensor nodes also have the capability to model continuous changes over 
time: A browser generates sampling events on the fraction-changed and time 
eventOut fields® of TimeSensors. The sampling frequency is implementation de- 
pendent, but typically, samples would be produced once per frame — e.g., once 
for every rendering of the user’s view on the world. 

Additionally, VRML requires continuous changes to be up-to-date during the 
processing of discrete events, i.e., “continuous changes that are occurring at the 
discrete event’s timestamp shall behave as if they generate events at that same 
timestamp” ([6], paragraph 4.11.3.). 

Example 1 Figure 1(a) depicts a simple event cascade. The TimeSensor’s 
isOver eventOut sends <true, touchTime> when the user moves the point- 
ing device over its geometry and <false, retractTime> upon retraction. 
These events are routed to a Script node — amongst other destinations in a 
fan-out configuration — which performs author-defined event processing. In 
this example resulting in colour value being sent to a Material node. A world 
author might employ such a scenario to provide user feedback for the touch 
of a button. 

Example 2 The TimeSensor in figure 1(b) produces continuous events contain- 
ing a number in the range [0, 1] on its fraction-changed field with the passage 
of time. These continuous events are passed to a Positioninterpolator that 
animates the translation vector of a Transform node. In this way, VRML 
provides support for linear key-framed animation. A fan-in situation can 

® Called loop breaking rule in VRML ([6], paragraph 4.10.4.) 

® fraetion-changed describes the completed fraction of the current cycle as a float 
value in the interval [0, 1]; time sends the absolute time in seconds since Jan 1, 1970, 
00:00:00 GMT as a floating-point value. 
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(a) Discrete Initial 
Event 



(b) Continuous Initial Events driv- 
ing an Animation 



Fig. 1. Simple Event Cascades for Different Sensor Events (circles depict differ- 
ent field types: filled <->eventOut, empty<->eventln, semi-filled<->exposedField) 



arise for the Transform node, if both Positioninterpolators send events with 
identical timestamp. 

2.4 Sequential Implementation 

Algorithm 1 shows the pseudo-code algorithm of a typical VRML97 browser. If 
no discrete events are scheduled, continuous events are sampled as quickly as 
possible, adapting the sampling frequency to hardware capabilities. This event 
evaluation is alternated with frame rendering of the new geometric layout. 

Scheduled discrete events force the evaluation of all continuous events at that 
same time (see up-to-date requirement above). If any discrete events have not 
yet been evaluated, no rendering takes place. 

Algorithm 2 shows the evaluation of the event cascade for each initial (sensor) 
event Ci or Di (mapped to E). The loop breaking rule prohibits cyclic loops by 
limiting each eventOut to only one event per timestamp. Otherwise, R' contains 
all edges of the routing graph pointing out of E. R's fan-out destinations Irii 
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Algorithm 1 Sequential VRML97 Pseudocode 
lasttime <— 0; 

loop 

now <— Browser. getWoridTimei)', 

if any discrete sensor eventOuts Si scheduled with lasttime < ts- < now, e.g., 
asynchronous user input, or finished TimeSensor cycle then 

to ^ time of most imminent Si\ 

D ^ {Dj\tOj = to}-, 

C <— sample of all continuous eventOuts at time to', 

evaluate event cascade for each Ci € C\ /^algorithm 2*/ 

evaluate event cascades for each Dj e D-, /*algorithm 2*/ 

lasttime = to; 

else 

C ^ continuous events sampled from all active and enabled TimeSensors at 
time now, 

evaluate event cascades for each C'i E C\ /^algorithm 2*/ 

lasttime = now, 

rendering of the new geometric world layout; 

end if 
end loop 



Algorithm 2 Event Cascade Evaluation for a sensor Event E 
if eventOut E has already ‘fired’ for time ts then 



stop; 



loop breaking rule 



else 

R' <— {(Out, Im) C R\Out = E} 

process all Ini, potentially generating a set of new events E(j for each Im; 
evaluate event cascades for all E{j produced by using this algorithm recursively; 

end if 



are evaluated in turn. Possibly, event processing at the destination Irii may 
result in the creation of further events E', and therefore recursive invocations of 
algorithm 2 until the complete event cascade is evaluated. 

Algorithm 2 represents only one possible way of ordering event processing of 
conceptually simultaneously occurring events for sequential execution. Beyond 
the requirement that events be evaluated in timestamp order, VRML does not 
specify any ordering of event processing. I.e., the evaluation order of branches 
in a fan-out configuration as well as for eventin processing at fan-in nodes is 
implementation dependent. 
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Fig. 2. Parallel Evaluation of Single Event E. All events have the same times- 
tamp tE- 



3 Opportunities for Parallelism 

As worlds become more complex, the main loop of algorithm 1 takes more time, 
which can result in a reduced sampling frequency for continuous events, and 
therefore jerky scene updates. Further, the system may become over-saturated 
with discrete events if they are generated more frequently than the system is 
able to evaluate their event cascades. In this section we examine opportunities 
for tackling these problems by parallelising the VRML execution model. 

3.1 Parallelism within a Single Event Cascade 

In algorithm 2, if a single initial sensor event E has a fan-out configuration, all 
eventin fields Irii linked to it can be processed in parallel (see figure 2). Recur- 
sion may lead to an even higher degree of parallelism. This is possible without 
affecting VRML97 semantics, as no evaluation order for fan-out events Irii is de- 
fined. As event notification is the sole communication mechanism between nodes, 
there can be no undesirable interference between two execution paths. 

Due to fan-in configurations, two execution paths might reunite at one, com- 
mon node. To avoid unwanted side effects in updating the node’s private fields, 
it is paramount that event processing is performed sequentially at the node. I.e, 
some form of synchronisation is necessary for incoming events — for example a 
queue which buffers pending requests for processing. 

Widely branching event cascades produced by single sensor events may ex- 
hibit high degrees of parallelism. The grain size is only determined by the com- 
plexity of event processing in the participating nodes. 

3.2 Parallelism between Event Cascades 

If several initial sensor events are scheduled with the same timestamp, VRML 
treats them as if they are members of the same event cascade. Fan-ins of events 
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Fig. 3. Event Cascade with several Initial Events Ei all of which have the same 
timestamp Ie- 



with the same timestamp are allowed and ordering is in the implementation’s 
responsibility. Multiple writes to a single eventOut field are inhibited to satisfy 
VRML’s loop breaking rule. 

All events Dj and Ci scheduled in the main loop of algorithm 1, can therefore 
be evaluated in parallel^ (see figure 3), with the same restrictions for fan-in as 
discussed in 3.1. 



3.3 Routing Graph Partition 

Parallelising event cascades with different times is more intricate. The VRML 
specification requires that events be evaluated in timestamp order. Parallel pro- 
cessing of event cascades for different timestamps could result in a node process- 
ing events out-of-order and thus violating the VRML specification. However, if 
we can identify disjoint partitions of the routing graph, then parallelism can be 
exploited. The routing graph is defined as a structure connecting eventin fields 
to eventOuts. We define a partition as all routes that are reachable from any 
node, following all eventOuts at the destination node. 

Eor disjoint partitions, event cascades with different times can run in parallel, 
as no interference can take place. Within a partition, such cascades have to 
be serialised in order of timestamps. This ignores the issue of a dynamically 
changing routing graph®, which would require the dynamic examination of the 
routing graph. 

This approach might minimally influence the perception of the world: users 
may notice the effects of out-of-order changes to visible nodes. However, we 
can assume that such differences in timestamps would only be in the range 
of a few milliseconds, and this is therefore unimportant for almost all worlds. 

^ i.e., by spawning several instances of algorithm 2 for each event 
® Script nodes in VRML may be programmed to change the topology of the routing 
graph dynamically 
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Causally related behaviour will always be presented in the correct order as this 

is sequentialised through dependencies in the routing graph. 

3.4 Further Parallelism 

Beyond the above, further opportunities for parallelism are as below: 

Evaluation of Sensor Nodes can be done in parallel if their required sensor 
information is available (e.g. current time, user location, etc.). Sensor nodes 
may then register discrete events with a Scheduler. 

Scheduler The whole of algorithm 1 may be replicated for each partition of the 
routing graph. Again, synchronised time must be available at each location. 



4 Implementation 

The following gives a brief overview of the System Of Dynamic Active Objects 
(SODA), which is used as programming model and runtime system for imple- 
mentation of the VRML server. 

4.1 Active Objects Programming Model 

SODA adopts a programming model of coexisting active and passive objects, 
similar to ProActive [7]. Active objects encapsulate a concurrent activity and 
have a queue to buffer method invocations for serial processing by this activity. 
Neither explicit thread programming nor intra-object synchronisation code is 
required. Active objects are globally addressable and passed by reference. In 
contrast, passive objects can only exist privately to a single active object and 
consequently have pass-by-value semantics. 

Programs consist of a collection of active and private objects, which may 
proceed in parallel if distributed over several processors. Unlike ProActive, active 
objects are fully location transparent and do not require explicit mapping to a 
parallel platform. 

Method calls are by definition non-blocking. The callee can proceed without 
waiting for the caller to return. Upon termination of the method the callee may 
hand back results in a future mechanism or by using continuation objects [2]. 

4.2 SODA Runtime 

The SODA runtime system is characterised by several key features: 

Dynamic Load Balancing. Transparently to the programmer, the runtime 
system is responsible for migrating active objects during runtime with the 
aim of maximising processor utilisation for the overall system. This is impor- 
tant where active objects have relatively high fluctuations in their resource 
requirements. 
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Active Object Multiplexing. Active objects may be multiplexed onto 
threads to reduce the number of created threads and prevent thread flooding. 
True One-Way Method Calls. Many related systems (e.g. JavaParty [8] and 
ProActive) use RMI as transport protocol for method invocations on active 
objects. This has two disadvantages for modelling an active objects model. 
Firstly, RMI does not provide asynchronous method calls. Secondly, a net- 
work socket is opened for every client-callee pair. This is not optimal for a 
large number of active objects. 

ARMI [9] overcomes the first limitation by creating an additional thread 
at the caller to wait for method call termination. In contrast, SODA uses 
a socket-layer communication protocol to implement one-way calls without 
the need to spawn additional threads on the caller-side. In addition, SODA 
performs dynamic connection management: at most one TCP/IP socket con- 
nection is established between every pair of hosts at any time. SODA also 
exploits “unexpected locality”® of active objects [8]. 

Efficient Object Serialisation. Standard Java object serialisation is unsuit- 
able for high-performance computation [8]. SODA therefore uses explicit se- 
rialisation/deserialisation routines for every object. These routines will later 
be generated automatically through a special classloader performing load- 
time transformation of Java byte-code. 

4.3 Mapping onto Active Objects 

VRML nodes and SODA active objects share many commonalities. Following 
the object-orientation paradigm, communication among VRML nodes can only 
take place through a well-defined interface. In both systems, incoming messages 
trigger the asynchronous execution of member functions. 

We applied a mapping between components of the two systems (see figure 4) 
as follows: VRML nodes are directly represented by active objects. Those nodes 
may then perform parallel event generation or processing, which is the main- 
stay for parallel event cascade evaluation. Asynchronous VRML event passing is 
mapped onto the asychronous method call semantics of SODA. Figure 4 shows 
how all elements of the VRML execution model find a valid equivalent in SODA. 

The scheduler described by algorithm 1 is implemented as an additional 
active object. SODA’s future mechanism is used to inform the scheduler object 
about termination of an event cascade. 

4.4 Client-Server Communication 

Client-server communication is based on a light-weight datagram protocol for 
performance reasons. Clients may communicate with any server node to send 
the user’s position and to receive information about objects within their re- 
spective view volume (figure 5) . Client-server communication can be reduced by 
performing server-side culling [10]. 

® I.e., no expensive loopback communication takes place if caller and callee reside on 
the same JVM. 
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Fig. 5. Parallel VRML Client-Server Model 



4.5 Preliminary Performance Results 

A cluster of eight Pentium II 233 workstations was used for initial performance 
measurements. The machines were connected via fast ethernet LAN and running 
Linux with Sun’s JVM 1.2.2. 

The example VRML world consisted of a single TimeSensor, routing events to 
a set of Positioninterpolators. These Interpolator nodes in turn were responsible 
for updating the coordinates of a set of geometry nodes. During the experiments, 
we varied the number of geometry nodes and also the number of workstations 
partipating in the server-computation. Each time we measured the average up- 
date frequencies (framerates) the server could achieve. 

The results in figure 6 show that real speedups can be obtained. The same 
program run on eight workstations can produce higher framerates throughout, 
compared to one-, two- and four-workstation setups. Speedups relative to the 
single workstation run increase with the number of interpolator-driven geometry 
nodes in the world; this can be explained by a growing granularity of parallel 
activities. 











A Parallel VRML97 Server Based on Active Objects 



59 








1 workstation — ' — 

2 workstations ----x--- 
4 workstations 






8 workstations 


....Q. ... 

























































nr of VRML nodes 



nr of VRML nodes 



(a) Average framerates in relation 
to the number of VRML nodes in 
the world 



(b) Speedup of the parallel versions, 
normalised to the framerate of the 
single- workstation version 



Fig. 6. Performance Results of the Prototype Implementation. 



5 Conclusion and Further Work 

This paper has presented a novel, scalable, client-server based approach to im- 
plementing complex virtual worlds with many interacting users. The clients that 
browse the world are protected from the costs required to support a large, com- 
plex world by the server, which carries the burden of progressing the state of the 
world, and determining the fraction of the world that is visible to each client. The 
work of the client is restricted to rendering the visible world fraction whenever 
it receives updates from the server. The server is able to support large, complex 
worlds due to its scalable, parallel design. The paper has shown how a paral- 
lel implementation of VRML can be built without changing the semantics of 
the execution mechanism. The results from the prototype demonstrate that real 
performance gains can be achieved. Experience in building the prototype using 
SODA has shown the power of the active object model for parallel, object-based 
software design. 

Current work includes tuning of the system, and analysing its ability to scale 
up to huge, complex worlds, with many users, running on a larger number of 
parallel nodes. 
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Abstract. Methodologies derived from Genetic Programming (GP) and 
Knowledge Discovery in Databases (KDD) were used in the parallel im- 
plementation of the indexer simulator to emulate the current World Wide 
Web (WWW) search engine indexers. This indexer followed the index- 
ing strategies that were employed by AltaVista and Inktomi that index 
each word in each Web document. The insights gained from the initial 
implementation of this simulator have resulted in the initial phase of the 
adaption of a biological model. The biological model will offer a basis for 
future developments associated with an integrated Pseudo-Search En- 
gine. The basic characteristics exhibited by the model will be translated 
so as to develop a model of an integrated search engine using GP. The 
evolutionary processes exhibited by this biological model will not only 
provide mechanisms for the storage, processing, and retrieval of valuable 
information but also for Web crawlers, as well as for an advanced com- 
munication system. The current Pseudo-Search Engine Indexer, capable 
of organizing limited subsets of Web documents, provides a foundation 
for the first simulator of this model. Adaptation of the model for the 
refinement of the Pseudo-Search Engine establishes order in the inherent 
interactions between the indexer, crawler and browser mechanisms by in- 
cluding the social (hierarchical) structure and simulated behavior of this 
complex system. The simulation of behavior will engender mechanisms 
that are controlled and coordinated in their various levels of complexity. 
This unique model will also provide a foundation for an evolutionary ex- 
pansion of the search engine as WWW documents continue to grow. The 
simulator results were generated using Message Passing Interface (MPI) 
on a network of SUN workstations and an IBM SP2 computer system. 



1 Introduction 

The addition of new and improved genetic programming methodologies [14], [36] 
will enable the preliminary Pseudo-Indexer model [33] to generate a population 
of solutions [6], [22] that provide some order to the diverse set of Web pages 
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comprising the current and future training sets. The applicability of genetic pro- 
gramming to this task results from the existence of an adequate population size 
in relation to the difficulty in organizing the diverse set of Web pages [30] , [34] . 

Studies of parallel implementations of the genetic programming method- 
ology [6], [15], [18], [25] indicated that population evaluation is the most time- 
consuming associated process. Population evaluations for the Psendo-Search En- 
gine’s Indexer will result from calculating the fitness measures associated with 
each Web page after one of the following: 1) parsing the training set, 2) additions 
to the training set, or 3) the execution of one or more of the GP operators. The 
cost associated with the fitness computations [18] offsets the cost associated with 
the load balancing and communication overheads. Previous GP studies have re- 
sulted in dynamic load-balancing schemes which can be used to monitor the 
irregularity in processor work loads, a result of parsing variable size Web pages 
(see Figure 2). A major shortcoming of GP applications [6] is the amount of 
execution time required to achieve a suitable solution. 



Table 1. Genetic Programming and its corresponding Evolutionary Operators. 



Cause 


Biological Model 


Genetic Programming 
Operators 


Individuals fly out 


Migrat ion /swarming 


Migration 


Males/females mate 


Reproduction 


Crossover /reproduction 


Evolutionary changes 


Mutation 


Mutation / editing 



2 Chromosome Modeling Using Genetic Methodologies 

2.1 Genetic Methodologies 

Genetic programming is an evolutionary methodology [36] that extends the tech- 
niques associated with Genetic Algorithms (GAs) [4]. The evolutionary force of 
these methodologies reflects the fitness of the population. The basis of GAs re- 
sults from designing an artificial chromosome of a fixed size that maps the points 
in the problem search space to instances of the artificial chromosome. The ar- 
tificial chromosome is derived by assigning variables of the problem to specific 
locations (genes). The memes [1] denote the value of a particular gene variable. 
Genetic algorithms provide an efficient mechanism for multidimensional search 
spaces that may be highly complex and nonlinear. The components of a GP are: 

1. Terminal set. The terminal set consists of input variables or constants. 

2. Function set. The functional set varies, based on the GP application, by 
providing domain-specific functions that construct the potential solutions. 
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3. Fitness measure(s). The fitness measure(s) provide numeric values for the 
individual components associated with the members of a population. 

4. Algorithm control parameters. The algorithm control parameters are depen- 
dent on population size and reproduction rates (crossover rate and mutation 
rate). 

5. Terminal criterion. The terminal criterion uses the fitness measures to de- 
termine the appropriateness of each solution based on an error tolerance or 
a limit on the number of allowable generations. 

The search space is defined by the terminal set, function set, and fitness mea- 
sures. The quality and speed of a GP application is controlled by the algorithm 
control parameters and terminal criterion. 

Genetic programming is composed of fitness evaluations [15], [18] that result 
from the application of genetic operators to individual members of a chosen 
population (Web pages). The operators incorporated in this methodology are 
individual (or subgroup) migration, reproduction, crossover, and mutation. The 
use of a linear string of information can result from the direct modeling of DNA. 
The outcome from applying the genetic operations to this string of information 
corresponds to obtaining a globally optimum (or near-optimum) point in the 
original search space of the problem. 

The migration operator consists of a process to select individual(s) to delete 
from the population. The reproduction operator consists of a process to copy an 
individual into a new population. The crossover operator generates new offspring 
by swapping subtrees of the two parents. The mutation operator randomly se- 
lects a subtree of a chosen individual. This process then randomly selects a new 
subtree to replace the selected subtree. The application of the mutation operator 
reduces the possibility of achieving a solution that represents a local optimum. 
The selection of individuals from the distinct subpopulations may follow several 
formats. A new methodology will be developed when applying these operators 
to subpopulations of Web pages. 

2.2 Modeling Chromosomes 

The Pseudo-Search Engine Indexers’ hybrid chromosome structure in Figure 1 
follows the methodologies of GP and GAs. These structures represent subsets of 
Web pages (subpopulations) that reside at each node (Web site) in a distributed 
computer system. Each strand of genes that reside on each Nodci (Web site) is 
viewed as a set of the genetic components of an individual member of a simulated 
species. Each horizontal strand in the chromosome structure represents a Web 
page that would translate into a meme. The bracket to the left of the Web pages 
implies that the pages have similar characteristics to those that comprise a gene 
(allele) and its memes. 

The components of the genes are referred to as the memes and the number 
of memes vary within each allele. The memes are the actual Web pages corre- 
sponding to primitive features that are contained at each Web site. New allele 
are formed by the addition of new Web pages at a given Web site. When new 
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Fig. 1. Distribution of Web pages. 



memes are added to enhance an alleles’ current set of memes, each allele can 
grow in size, but its chromosome length remains fixed. The application of the 
GP crossover operator results in two new chromosomes, formulated from the 
transmission of components of the genetic makeup of the parents. The bracket 
mechanism provides a numerical order to the Web pages in this structure. This 
approach provides a mechanism to facilitate the evolution of diverse nodes. The 
use of a single population leads to panmictic selection [15] in which the indi- 
viduals selected to participate in a genetic operation can be from anywhere in 
the population. The method used to avoid local optima [22] involves subpopu- 
lations [24]. This model will be expanded by simulating double-stranded RNA 
genomes [10] as the population of indexed Web pages grows. 



3 The Biological Model 

The biological model for the Pseudo-Search Engine [32] is based on the social 
structure of honeybees [9] , [29] . A mathematical model of the social structure of 
honeybees will be used to enhance the incorporation of GP methodologies [14] 
since the bee colony represents a highly evolved biological system [17] which 
forms a basis to model the continuous expansion of Web pages. 

The genetic programming approach incorporates the following genetic op- 
erators: migration, reproduction, cross-over, and mutations. A similar group of 
evolutionary operators for honeybee colonies are: migration, swarming, and su- 
persedure (replacement of an existing, older queen by a younger queen follow- 
ing a fight). The evolutionary operators associated with the queen, drones, and 
worker bees are similar to the genetic programming operators. The cross-over 
operator is similar to the mating process for the queen bee, but differs since the 
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parent chromosomes cease to exists in GP but only the queen persist in the evo- 
lutionary sense. The children chromosomes replace their corresponding parent 
chromosomes in standard GP. The migration operator in GP purges an existing 
subpopulation of the least desirable members (traits) and in some cases the best 
member (trait) [25]. This process was implemented in GP as an attempt to avoid 
local optima. 



Table 2. Factors that determine the effect of the Evolutionary Oper- 
ators in a Honeybee Colony. 





Causes of the Evolutionary Effect 




Small nest 
cavity 


Crowding 


Queen 

rearing 


Unsatisfactory 

environmental 

conditions 


Bees 

fly 

out 


Failure or 
death of 
queen 


Colony 

activities 

hindered 


Migration 


X 


X 




X 


X 




X 


Swarming 


X 


X 


X 




X 




X 


Supersedure 




X 


X 






X 


X 



In a true evolutionary model individuals migrate from one subpopulation 
to another [7] for many diverse reasons such as crowding, changes in the en- 
vironmental conditions, limitations on colony activities, or members becoming 
disoriented (see Tables 1 and 2). These reason lead to the incorporation of fault 
tolerance in the architecture of the integrated search engine. These external 
evolutionary factors benefit the gene pool by ensuring diversity. The mating rit- 
ual [9] for queen bees in colonies provides a built-in mechanism for incorporating 
a host of diverse genetic profiles into existing and/or new colonies. The drone 
bee mates once and dies - a process similar to a worker bee using its stinger and 
dying. 



4 Mechanisms for Expanding the Current Load Balancing 
Model 

4.1 Overview 

The social structure associated with honeybee hierarchy provides an ordered 
structure to what can be referred to as the simplest solution to the problem 
of multiway rendezvous. Figure 2 shows the initial implemented load-balancing 
model using a MPI algorithm [19] as the basis. This model followed the approach 
of a node manager for distributed computing. Similar approaches have been im- 
plemented for general distributive computing, as well as for the implementation 
of parallel GP load-balancing models. The implementation of the load-balancing 
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Fig. 2. Load balancing model. 



mechanism for the Pseudo-Search Engine Indexer model follows the theory as- 
sociated with the implementation of an Event Manager (EM) [2] . 

The EM concept provides a paradigm for the development and implementa- 
tion of interface mechanisms associated with the three major components of the 
Pseudo-Search Engine. The manager interface paradigm will be an extension of 
the multiway rendezvous model [2]. This model provides the following benefits: 1) 
an extension of the binary rendezvous model where communication involves the 
synchronization of exactly two nodes, and 2) mechanisms for the synchronous 
communication between an arbitrary number of asynchronous nodes. These in- 
terface components of the Pseudo-Search Engine managers are: 

1. Web scout/forager manager, Mi 

2. Web page indexer manager, M2 

3. Web browser interface manager, M3 

Each of these three managers will control its respective load-balancing mecha- 
nisms based on its respective functionality (see Figure 3). 



4.2 Foraging Web Scouts/Crawlers for the Pseudo-search Engine: 
An Active Networks Approach Using Genetic Programming 

Overview of Web Scouts/Crawlers. The efficiency of Internet applica- 
tions is being tested by the addition of new applications that compete for the 
same network resources. Studies associated with network traffic [5], [16] show 
the need for adaptive congestion control and avoidance at the application level. 
The side-effects of the current non-adaptive application mechanisms result in 
self-similarity among network transmissions. The need for efficient Web scouts 
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Fig. 3. The load balancing architecture of the integrated search engine. 
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(probes) for the Pseudo-Search Engine Web crawlers is due to future require- 
ments associated with new applications. The exponential growth of Web doc- 
uments, the incorporation of multimedia applications with real-time demands, 
and a steady increase in WWW users will lead to refinements in efficient design 
and implementation of crawler mechanisms. The competition for bandwidth will 
reward the adaptive and efficient applications. The incorporation of active net- 
works (ANs) methodologies [27], [28] can enhance the development and incorpo- 
ration of the biological model associated with the Pseudo-Search Engine. 

Aspects of the foraging mechanisms used by the bee colony provide a basis 
for scout / crawler mechanisms to be used for congestion control and data trans- 
mission [13]. The factors that influence the amount of foraging are temperature, 
weather, and day length. The weather affects the availability of pollen and nec- 
tar. The temperature coupled with the time of day determines the quantity of 
pollen and/or nectar. The attractiveness of particular crops are rated based on 
several criteria. Similar mechanisms are needed to determine the routing ta- 
bles for retrieving Web pages from distributed computer networks that span the 
Internet and provide a diversity of resources. 



Overview of Active Networks. Active networks research provides insight 
into the software needed to support GP communicating agents being developed 
to retrieved WWW documents from the diverse set of Web sites. ANs enable 
the retrieval of state information from routers that support the infrastructure of 
the Internet by embedding active capsules (components) [11] within each packet 
transmitted via the Internet. The active capsules are executed on the routers 
as the packets traverse the Internet starting at a source (host) and possibly 
terminating at the destination (host). 

“Active networks” was developed to facilitate efficient network communica- 
tion [35] by incorporating active capsules in the packets that are executed and 
routed by the switches that support the Internet’s infrastructure. The ANs model 
was designed to minimize additional computational overhead at the router level 
needed to activate the capsule, but the overhead increases with the complexity of 
the transmitted active capsule component. Additional complexity can be added 
to the basic AN model through the enhancement of the execution environments 
(EEs) [3] which result in virtual EEs. This area of research provides execution 
environments which can be used to program routers to capture state information 
associated with LANs and/or WANs, which in turn can be incorporated into a 
methodology for creating scouts/crawlers needed for the retrieval of Web pages 
for the Pseudo-Search Engine [30]. 



4.3 Proposed Web Scout and Crawler Mechanisms 

The Web crawlers required to adequately retrieve the growing number of Web 
pages will require some form of adaptive methodology as each Web scout (probe) 
searches for efficient paths (routes) to an adequate source of information (Web 
documents) to build Internet Service Provider (ISP) router tables for the crawler 
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mechanisms. The initial step of this proposed methodology is to send out scouts 
to all ISP providers in a manner similar to reliable flooding [20] . The purpose of 
collecting timing and path information to and from the ISP providers reflects the 
need to find efficient routes to the portal associated with the hosts of information 
reflected in its hierarchy structure of sub-hosts. Each provider is viewed as a 
gateway into the information associated with its sub-host Web page directory 
structure. This methodology has the ability to discover new ISPs, as well as new 
sub-hosts providing services to new and existing Web clients. The end effect is 
the faster discovery of new Web pages. 

4.4 Strategies for Communicating Agents Using Genetic 
Programming 

Iba et al. [12] presented studies of communicating agents that reflect the need to 
evaluate techniques for developing cooperating strategies. One application of this 
methodology is the Predator-Prey pursuit problem - a test bed in Distributed 
Artificial Intelligence (DAI) research - that measured the impact of limited abil- 
ity and partial information for agents pursuing/seeking the same goal indepen- 
dently, instead of relying on cooperation to solve a discrete set of subproblems. 
The metrics associated with this aspect of GP research included: 1) applicability 
of GP to multi-agent test beds, 2) observing the robustness (brittleness) of co- 
operative behavior, and 3) examining the effectiveness of communication among 
multiple agents. This co-evolutionary strategy provides a methodology for the 
comprehensive assessment of the impact of robustness (brittleness) of coopera- 
tive behavior and its effectiveness among communicating agents. The robustness 
of a GP program was defined [12] as the ability of agents to cope with noisy 
or unknown situations (unknown test data) within a GP application when com- 
munication among multiple agents was due to effective work partitioning. New 
and potentially improved behavior patterns were found to evolve through the 
use of a fitness measure associated with a co-evolutionary strategy. The panoply 
(multiplicity) of relationships among the communicating agents include: 

— Agents requesting data from other agents (Communicating Agents) 

— Agents negotiating their movements with other agents (Negotiating Agents) 

— Agents controlling other agents (Controlling Agents) 



5 Limitations of the Parallel Implementations 

5.1 Timing Results 

Message-passing studies [8], [26] have been conducted to determine the efficiency 
of parallel programs, implemented on shared as well as distributed memory com- 
puter systems. The implemented message-passing paradigm depended upon a 
client-server model [19] with n — 1 clients for a sub-cluster of n nodes as its 
basis. The message size used in all data transmissions was consistent and the 
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Fig. 4. Execution times for the workstations. 



data type contained an array of 101 characters. The sending and receiving mes- 
sage patterns for the n node cluster varied according to OS tasks and the tasks 
of other users on the nodes in the clusters. This study was not conducted in 
dedicated cluster environments. 

Walker [31] described the load-balancing model that led to the execution 
times in Figures 4 and 5. The execution results displayed reflect the diversity 
existing among the different types of computer hardware used in this study. 
These results also reflect quasi-dedicated computer environments associated with 
tightly coupled and loosely coupled parallel computer models. The workstation 
timing results show consistent increases in required CPU time as the training set 
size increases. The IBM SP2 [23] results display spikes that reflect the impact 
of other users in a tightly coupled environment, as well as the nondeterministic 
execution of the load-balancing model [21]. 

6 Conclusion 

The current Pseudo-Search Engine Indexer, capable of organizing limited sub- 
sets of Web documents, provides a foundation for the first beehive simulators. 
Adaptation of the honeybee model for the refinement of the Pseudo-Search En- 
gine establishes order in the inherent interactions between the indexer, crawler 
and browser mechanisms by including the social (hierarchical) structure and 
simulated behavior of the honeybee model. The simulation of behavior will en- 
gender mechanisms that are controlled and coordinated in their various levels of 
complexity. 

Communication hardware and page size irregularity affect the load distribu- 
tion of the Web pages. These effects show in the uneven distribution of pages, 
as well as the erratic behavior in the execution times in Figures 4 and 5. The 




Parallel Pseudo-search Engine Indexers/Crawler Mechanisms 



71 




0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 



Number of Processors (Nodes) 

Fig. 5. Execution times for the IBM SP2. 



nondeterministic execution of the receives by the program manager may be re- 
duced by incorporating the order of receive/send operators for the clients (Web 
sites) requesting Web pages from the Indexer program manager. 

The timing results associated with the network of workstations generated 
were as expected. The predicted output for this environment was also displayed 
in the speedup and efficiency results. The self-scheduling, load-balancing model 
indicated that increases in the workload will improve efficiency. The use of 512 
Web pages showed that the model will provide an ideal starting point for in- 
creasing the workload (addition of the genetic operators) without increasing the 
number of Web pages. The speedup and efficiency results associated with the 
IBM SP2 point out the need for a specific load-balancing model. 
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Abstract. Networks of workstations (NOWs) have become important 
and cost-effective parallel platforms for scientific computations. In prac- 
tice, a NOW system is heterogeneous and non-dedicated. These two 
unique factors make scheduling policies on multiprocessor/multicomputer 
systems unsuitable for NOWs, but the coscheduling principle is still an 
important basis for parallel process scheduling in these environments. 
The main idea of this technique is to schedule the set of tasks composing 
a parallel application at the same time, to increase their communication 
performance. In this article we present an explicit coscheduling algorithm 
implemented in a Linux NOW, of PVM distributed tasks, based on Real 
Time priority assignment. The main goal of the algorithm is to execute 
efficiently distributed applications without excessively damaging the re- 
sponse time of local tasks. Extensive performance analysis as well as 
studies of the parameters and overheads involved in the implementation 
demonstrated the applicability of the proposed algorithm. 



1 Introduction 

Parallel and distributed computing in a network of workstations (NOWs) re- 
ceives ever increasing attention. Recently, a research goal is to build a NOW 
that runs parallel programs with performance equivalent to a MPP (Massively 
Parallel Processor) and executes sequential programs as a dedicated uniprocessor 
too. Nevertheless, two issues must be addressed: how to coordinate the simul- 
taneous execution of the processes of a parallel job, and how to manage the 
interaction between parallel and local user jobs. 

The studies in [1] indicate that the workstations in a NOW are normally 
underloaded. Basically, there are two methods of making use of these CPU idle 
cycles, task migration [2,3] and job scheduling [4,5,6]. In a NOW, in accordance 
with the research realized by Arpaci [7], task migration overheads and the un- 
predictable behavior of local users may lower the effectiveness of this method. 
Our research was focussed on the approach of keeping both local and parallel 
jobs together and effective, and efficiently scheduling them. 

* This work was supported by the CICYT under contract TIC98-0433 
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A NOW system is heterogeneous and non-dedicated. The heterogeneity can 
be modeled by the Power weight [8]. As for the non-dedicated feature, a mech- 
anism must be provided to ensure that no extra context switch overheads due 
to synchronization delays are introduced. Outerhout’s solution for timeshared 
multiprocessor systems was coscheduling [9]. Under this traditional form of 
coscheduling, the processes constituting a parallel job are scheduled simulta- 
neously across as many nodes of a multiprocessor as they require. 

Explicit coscheduling [9,5] ensures that scheduling of communicating jobs is 
coordinated by constructing a static global list of the order in which jobs should 
be scheduled; a simultaneous global context switch is then required in all the 
processors. Zhang [4], based on the coscheduling principle, has implemented the 
so-called ’’self-coordinated local scheduler”, which guarantees the performance 
of both local and parallel jobs in a NOW by a time-sharing and priority-based 
operating system. He varies the priority of the processes according to the power 
usage agreement between local and parallel jobs. 

In contrast with Zhang’s study, a real implementation of explicit coscheduling 
in a NOW is presented in this article; so that the user of the parallel machine 
has all the computing power of the NOW available during a short period of time 
with the main aim of obtaining good performance of distributed tasks without 
excessively damaging the local ones. 

In section 2, the environment DTS (Distributed Scheduler) where the cosche- 
duling implementation is built will be introduced. In section 3, our explicit 
coscheduling algorithm of PVM distributed tasks in a Linux NOW is presented. 
Also, a synchronization algorithm that improves the performance of the message 
passing in distributed tasks is proposed. In section 4, the good behavior of the 
implemented algorithms is checked by means of measuring the execution time 
on both synthetic applications and NAS benchmarks; in addition, the response 
time of local jobs and other parameters and overheads of special interest are 
measured. Finally, the last section includes our conclusions and a description of 
the future work. 



2 DTS Environment 

We are interested in assigning a period of time to distributed tasks and an- 
other to interactive ones, and varying these dynamically according to the local 
load average of a NOW. Also, our aim is to avoid modifying the kernel source, 
because of the need of a portable system. Our solution consists of promoting 
the distributed tasks (initially timesharing) to real-time. Furthermore, in each 
workstation all the distributed tasks were put in the same group of processes; it 
allows control of their execution by means of stop and resume signals. In such 
a way that, this splits the CPU time in two different periods, the parallel slice 
and the interactive one. 

The implemented system, called DTS, which is an improved version of the 
original described in [10] is composed of three types of modules, the Scheduler, 
the Load and the Console. The Scheduler and the Load are composed of dis- 
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tributed processes running on each active workstation. The function of each 
Scheduler is the dynamic variation of the amount of CPU cycles exclusively as- 
signed to execute distributed tasks {PS: Parallel Slice), and the amount of time 
assigned to local tasks {IS: Interactive Slice). The Load processes collect the 
interactive load on every workstation. The Console can be executed from any 
node of the NOW and is responsible for managing and controlling the system. 
For notation convenience the set of active nodes in the NOW are called VM 
(Virtual Machine), (see Figure 1). 
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Fig. 1. DTS environment. 



Our environment is started running automatically by the pvm environment. 
In each workstation composing our VM, the pvmd shell script has been modified 
as follows: the sentence ” exec $PVM_ROOT/lib/pvmd3 $@” has been changed 
to exec $PVM_ROOT/lib/scheduler $@”. This way, when the workstation is 
added/activated to the virtual machine (even if it is the first) from the pvm 
console, the Scheduler is executed. 

3 Coscheduling Implementation 

In this section the coscheduling algorithm implemented over the DTS’s sche- 
duler daemon is explained. Furthermore, some improvements in communication 
performance of the system are presented with the addition of a synchronization 
algorithm of the distributed tasks. 



3.1 Coscheduling Algorithm 

The coscheduling algorithm is shown in the Figure 2. 
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Scheduler 

set 'P'KKyScheduler) = ((max(ri_pnoriJj/)) and SCHED_FIFO) 
fork&exec {Load) 
fork&exec (pvmd ) 

set PRI(p«;md) = ((max(rt_pnonti/) - 1) and SCHED_RR) 
set PRI(ioad) = ((max(rt_pnonty) - 2) and SCHED_FIFO) 
set pvmd leader of pvm-tasks 

sync_p: 

while{pvmJ,asks) do 
sleep(PS) 

signal_stop {pvm_tasks ) 
sleep(IS) 

signal_resume {pvm-tasks ) 
end/*while*/ 



Fig. 2. Coscheduling algorithm 



At the beginning of execution, the Scheduler, which has root permissions, 
promotes itself to Real Time class (initially time shared). After that, it forks 
and executes Load and pvmd (the pvm daemon), and also promotes pvmd and 
Load to -1 and -2 Real Time priority lower than Scheduler respectively. Next, 
Scheduler, sets pvmd to become the leader of a new group of processes (denoted 
by pvm-tasks; this group will be composed of all the pvm tasks that pvmd will 
create) . 

The scheduling policy of every process (SCHED_FIFO or SCHFD_RR) is 
shown in the algorithm too, and denotes a FIFO or Round Robin scheduling 
respectively. Scheduler and Load have a FIFO policy because of their need to 
finish their work completely before they release the CPU. On the other hand, 
pvmd can block waiting for the receipt of an event, and meanwhile grant the 
CPU to another process, perhaps at the same priority level (a pvm task). For 
this reason, the scheduling policy has to be Round Robin. 

Following this, the Scheduler enters in a loop where each iteration takes LP 
(Iteration Period) ms, where IP = PS + IS. This loop stops when there are 
no more pvm tasks (including the pvmd). This occurs when the workstation is 
deleted from the pvm console. 

Thus, after the Parallel Slice (PS), all the Schedulers in the VM stop all the 
pvm tasks by sending a STOP signal to the group of PVM processes leadered 
by pvmd. After that, they are resumed at the end of the interactive slice by 
sending them a CONTINUE signal. Because the distributed tasks are running 
in real-time class, they have a global priority higher than the interactive ones. 
Thus, when they are resumed, they take control of the CPU. In this way, an 
interval of execution is assigned to distributed tasks and other interval (can be 
different) to interactive ones. Figure 3 shows the behavior of the DTS scheduler: 
each time the Scheduler is executed (during an insignificant time), it concedes 
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the CPU alternatively to the distributed (PS period) and to the interactive (IS 
period) tasks. 




IP 

Fig. 3. DTS environment behavior 



3.2 Scheduler Synchronization 

Only the algorithm of each Scheduler that runs in the VM has been explained, 
but how are they synchronized to execute the parallel and the interactive slice at 
the same time and how are these slices modified according to the Load Average 
in the VM? Figure 4 shows the schematic algorithm that has been used to solve 
these two questions. 

The tasks composing a distributed application can have basically CPU or 
message passing intensive phases. In the first case, it is unnecessary to syn- 
chronize the tasks. On the other hand, in the second case, the synchronization 
between the communication tasks can increase the global performance of the 
distributed application [7]. For this reason, DTS has two different modes of ope- 
ration. In the Dynamic mode, the PS and IS periods are synchronized over all 
the distributed tasks, whereas in the Distributed mode, the CPU intensive tasks 
are not synchronized at all. 

Every Load Interval {LI), all the Load processes collect the real CPU queue 
length (qi). The work done by Ferrary [11] shows that the length of the ready 
queue is a good index for measuring the load of a workstation. After N (Number 
of LI intervals of passed history to take into account) the Load Index, denoted 
as Qi, is computed and a message containing the load is sent to the Console. 
Exponential smoothing is used to compute the Load Index, defined as follows: 

Qi = Q^-le~^ + q^{l-e~^),i>l, ( 1 ) 



Qo — 0 ) 
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Loadj: WMj £VM 

i = 0, Qi_, = 0 

Each LI interval do 
collect (gi) 

co\lect{Net^ctivity) 

compute((3^) 

if (++i mod(N) == 0) 

if {Net Activity < Network LTreshold) 
set MODEJDTS = DISTRIBUTED 
compute(PS'&/S') 
set PSkIS 
else 

set MODEJDTS = DYNAMIC 
if (IQ^ — Q{_i\ < Load-Treshold)) 
senA{Console,Ql ) 

Console: Node Master 

if (MODE_DTS==DYNAMIC) 
while (not timeout) do 

for each Mj € UM async_receive((3^) 
compute{RLA,PSSzIS) 
broadcast (PS'&JS') 

Scheduler j \ MMj eVM 

if (MODE_DTS == DYNAMIC) 
async_receive(PS'&/S') 
set PSkIS 
goto sync_p 



Fig. 4. Synchronization Algorithm. Console is in the node master. There is a 
Load and Scheduler module in each node of the VM 



where Qi-i is the last computed index, qi is the real CPU queue and P = ^. 
Taking into account the studies done by Ferrari [11], a LI of 100 ms and N of 
10 has been chosen. 

Note that when Load collects qi, the distributed tasks are stopped, waiting 
out of the Ready queue. For this reason, the distributed tasks are not computed. 
In another situation, as for example systems where the priority of distributed 
tasks is increased and decreased periodically, the need to distinguish between 
distributed and interactive tasks adds a great overhead to the system. 

In Figure 4 is important to observe how DTS activates automatically either 
the DISTRIBUTED or the DYNAMIC mode of operation in each workstation, 
depending on the network activity {Nek Activity). The network activity is the 
number of messages sent or received every N > 1 = LI intervals by the pvmd. The 




Effective Explicit Coscheduling Algorithm on a NOW 81 

behavior of the nodes which operates in DISTRIBUTED or DYNAMIC mode is 
explained separately below. 

One centralized algorithm has been implemented, due basically to perfor- 
mance requirements of the local applications. On the other hand, if the algo- 
rithm was distributed, it would reduce the performance of the interactive tasks 
and would increase the network activity due to the high activity of the Load 
module, for example sending the load index of each node to all the VM. 



Dynamic Mode. In the reception of the Load indexes from all the active nodes 
or after a timeout, the Console computes the Relative Load Average (RLA), a 
metric used in DYNAMIC mode to fix the parallel and interactive slice on each 
workstation. The RLA is defined as follows: 



RLA = 



Z^j=i 



Q' 

Wj(A) 



NW 



(2) 



where Ql is the load index of workstation j, NW the number of workstations in 
the VM and Wj{A) the power weight of workstation j. The power weight [8] is 
defined as follows: 



WM) 



SM) 

max^^{Sk{A)] ’ 



j = I...NW 



(3) 



where Sj (A) is the speed of the workstation Mj to solve an application of size A 
on a dedicated system. Nevertheless, the experimental results in [8] show that if 
applications fit in the main memory, the power weight differences, using several 
applications, are insignificant. 

Table 1, which shows the relation between the RLA, PS and LS, is used 
to compute the PS and IS. The values of PS and IS shown in the table are 
percentages of the IP period. 



Table 1. Relation between RLA, PS and IS 



RLA 


IS 


PS 


0< RLA <0.25 


10 


90 


0.25< RLA <0.5 


20 


80 


0.5< RLA <0.75 


30 


70 


0.75< RLA <1 


40 


60 


1< RLA <1.25 


50 


50 


1.25< RLA <1.5 


60 


40 


1.5< RLA <1.75 


70 


30 


1.75< RLA <2 


80 


20 


2< RLA 


90 


10 



The Console sends PS and IS to all the Schedulers modules by a broadcast 
message. Broadcast delivery has been chosen due to the high cost of multicasting 
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or sending a message separately to each node of the VM. On asynchronous 
reception (done by a message handler), each Scheduler process sets its parallel 
and interactive slices and jumps to a predetermined address, the label sync_p 
(synchronization point) in the algorithm of Figure 2. 



Distributed Mode. Each workstation executing in this mode does not ex- 
change many messages, the communication can even be null. Therefore, syn- 
chronization is not required. The only factor to take into account is the efficient 
share of the CPU between the distributed and local tasks, so that each work- 
station sets the PS and IS slices according to its own sequential workload. Each 
node computes these values according to Table 1 too, but substituting RLA for 
Qi (the Load Index). 




Fig. 5. benchmarks: sinring (a) and sintree (b) 



4 Experimentation 

Our experimental environment is composed of eight 350 MHz Pentium with 128 
MB of memory and 512 KB of cache. All of them are connected through an Eth- 
ernet network of 100Mbps bandwidth and a minimal latency in the order of 0.1 
ms. All our parallel experimentation was carried out in a non dedicated environ- 
ment with an owner workload (defined as the averaged ready queue length) of 
0.5 (Light), 1 (Medium) and 2 (Heavy). Workload characterization was carried 
out by means of running a variable number of processes in background, represen- 
tatives of the typical programs of personal workstations. The performance of the 
coscheduling implementation was evaluated by running three kernel benchmarks 
from the NAS parallel benchmarks suite [12]: ep, an embarrassingly application, 
is, an integer sorting parallel application and mg, a parallel application that 
solves the Poisson problem using multi-grid iterations. Also, two synthetic ap- 
plications, sinring and sintree were implemented, representative of two types of 
communication patterns. The first implements a logical ring (see Figure 5(a)), 
and the second attends for the communication of one to various, and various to 
one (see Figure 5(b)). In both applications, every node executes different pro- 
cesses sequentially during a fixed period of time, which is an input argument 
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Table 2. Results obtained with PVM version 3.4.0 



Bench. 


Problem Size 


1 PVM (sec) 1 


Light 


Medium 


Heavy 


ep 


2^5 


202 


244 


694 


is 


2^A[o..2^“] 


126 


149 


253 


mg 


256x256x256 


645 


1336 


1976 


sintree 


20000 iterations 


92 


160 


210 


sinring 


80000 iterations 


203 


255 


377 



(1 ms by default). The number of iterations of both problems is also an input 
argument. 

Table 2 shows the benchmarks parameters used in the experimentation and 
the execution times obtained with PVM. Two different performance indexes are 
used: 



— Gain (G): This metric was used for evaluating the performance of our DTS 
system with respect to the original PVM. The gain is defined as follows: 



G = 



T 

^ pvm 



Tsched 



( 4 ) 



where Tpym (in seconds) is the execution time of one application in the 
original PVM environment and Tgched (in seconds) is the execution time of 
the same application in the DTS system. 

— Local Overhead (LO): This metric was used to quantify the local workload 
overhead introduced by the execution of the parallel applications. One of 
the scripts used for the simulation of the Heavy workload (a compilation 
process) was taken as a reference. The LO metric is defined as follows: 



LO 



LTy^Qyylyyl^cated LJT(iy(i2cated 
LT(iy(i2cated 



( 5 ) 



where ETdedicated is the execution time spent by the compilation process 
in a dedicated workstation (86 s), and ETnondedicated is the execution time 
obtained by executing the script together with parallel applications. 



4.1 Network Threshold 

As the algorithm of Fig. 4 shows, depending on the value of the network thresh- 
old, DTS activates the Dynamic or the Distributed mode of operation. In this 
section, the best threshold value has been studied. 

The results shown in the Fig. 6 were obtained with our synthetic bench- 
mark, sinring, whose network and CPU activity was under our control, since the 
sequential time of every process is an input argument of the benchmark. The 
experimentation was carried out with a medium owner workload. 

The network activity of every node, Net-Activity, was obtained by measuring 
the average number of packets received and delivered by the pvm daemon. The 
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Network Threshold 

Fig. 6. DTS Gain according to the network threshold value 



DTS gain, calculated in accordance with the formula 4, was obtained with our 
benchmark with two different values of the network activity when the threshold 
was varied between 0 and 1500 calls per second. As was expected, the results 
depend on the value of the threshold and the network activity of the benchmark. 
For a low communication benchmark case is better a high threshold whereas for 
a high communication benchmark is better a low threshold. Taking into account 
the above results, we have implemented DTS with a variable network threshold 
defined by the user. 

4.2 DTS Performance 

Fig. 7(left) shows the gain obtained in the execution of the three NAS bench- 
marks when they are executed in the DTS environment and with two different 
values of the IP period, 100 ms and 1000 ms. Fig. 7(right) shows the local work- 
load overhead in the execution of a compilation script when it is executed to- 
gether with different parallel benchmarks. The behavior of the DTS environment 
can be determined comparing these two figures. 

First of all, it is important to determine the IP value. An IP of 1000 ms 
increases the local overhead of interactive local tasks (see Fig. 7(right)) even if 
there is a light distributed workload. On the other hand, a value of less than 100 
ms increases the overhead produced by the addition of context switches. Taking 
these two considerations into account, an IP value of 100 ms was chosen and it 
is used in the rest of the article. 

From Fig. 7(left), it can be observed that in the ep case, a computing intensive 
benchmark, the results obtained in DTS with a light load are worse than in the 
PVM due to the additional overhead introduced by the DTS. On the other hand, 
since the DTS assigns to ep a better percentage of time than the PVM in the 
heavy case, the results are better. 

Globally, for high message passing applications, is and mg, better results 
were obtained due to the synchronization between communicating periods. 
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Fig. 7. (left) Scheduler results. L: Light, M: Medium and H: Heavy owner workload 
(right) local workload overhead 



In general, the DTS environment gains with respect to the original PVM in 
the cases when there is almost some local load. Obviously, this gain is at the 
expense of the local overhead introduced by DTS environment as shown in Fig. 
7(right), which, with the exception of the computing intensive distributed tasks, 
is very low and even it is reduced (is case). 



4.3 Local Load Measurements 

The performance of the distributed tasks can be improved at expense of a mo- 
derate local overhead, but how is the response time of the local applications 
affected? 

Table 3. Response time measurements (in ps). In the DTS case, IP = 100 ms 



Bench. 


1 PVM 


1 DTS 1 


average 


max 


average 


max 


sinring 


6.13 


201 


5.93 


247 


sintree 


6.01 


454 


5.94 


477 


ep 


15.65 


18043.7 


29.81 


20037.5 



With the aim of answering this question, comparison was made with the 
average response time of an implemented local benchmark which is executed 
jointly with one of the distributed benchmarks. The local benchmark contin- 
uously obtains the status of the standard output for printing by means of the 
select system call. The distributed benchmarks used were the sintree and sinring 
(message passing intensives) and ep (CPU bound). Table 3, shows the average 
and the maximum response time (max) in microseconds of three benchmarks, 
obtained in the two environments. 
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Table 3 shows that in any case the DTS system increases the response time 
of the local benchmark excessively. It is only slightly significant in the case 
of ep, but this response time overhead was not appreciated by the local user. 
Approximately the 98% of the collected samples were 5 or 6 (the response 
time of the select o.s. system call used in normal conditions), and only the 2% 
take higher values (due to the execution of the distributed benchmark), that 
increases significantly the average. We can conclude that with an IP of 100 ms 
the overhead added for the distributed tasks does not damage the response time 
of local applications excessively. 

4.4 Coscheduling Skew 

The method used in implementing coscheduling is by broadcasting a short mes- 
sage to all the nodes in the cluster. The PS and IS intervals must be synchronized 
between two pair of nodes but, due to the synchronization algorithm, some skew 
always exists between them. The coscheduling skew (5) is the maximum out of 
phase between two arbitrary nodes, formally: 

5 = max (broadcast) — min(broadcast) (6) 

where max (broadcast) and min (broadcast) are the maximum and minimum time 
in sending a short broadcast message. We have measured a coscheduling skew (5) 
of 0.1 ms with the aid of the Imbench [13] benchmark. This value is insignificant 
in relation to the 100 ms of the IP interval. Thus, the coscheduling skew has no 
significant influence on the performance of the DTS system. 

4.5 Context Switches 

With the help of an implemented program (called getcontext) the context switch 
cost was measured in each workstation in function of the process size. The work 
done by getcontext was simulated as the summing up of a large size array before 
passing on one token (a short message) to the next process. The processes (a 
variable quantity) were connected in a ring of Unix pipes. The summing was 
an unrolled loop of about 2.7 thousand instructions. The effect was that both 
the data and the instruction cache was polluted to an extent before the token 
was passed on. Passing the size of a benchmark to getcontext as argnment, it 
computed the context switch costs of that benchmark. 

Table 4 shows the sizes of the measured benchmarks, ep and sinring, and their 
correspondent context switch costs. The benchmarks that fit in the main memory 
(the sum of its Resident Set Size and the memory of the o.s. applications < 128 
Mbytes) were chosen, and those that overlap it (is and mg) were discarded. This 
solution was adopted to avoid the extra overhead added by the swapping latency 
for large applications that may cause inaccuracies in our measurements. Also, the 
number of context switches obtained with the ep and sinring benchmarks in the 
two environments (original PVM and DTS (IP = 100ms) with medium owner 
workload) were measured. The results are shown in Figure 8, where it can be 
seen that in the case of the benchmark with high communications (sinring), the 
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Table 4. Sizes of the benchmarks: VI (Virtual Image: text + data + stack) and RSS 
(Resident Set Size) in Kbytes. Context switch costs (in fis) of one instance of the 
benchmark when one copy (Ip.) or two (2 p.) were executed 



Bench. 




RSS 


EI3 


33 


ep 


ilBHl 


596 






sinring 




592 







number of context switches is lower in the DTS case due to the synchronization 
between distributed tasks. In the case of ep, the number of context switches is 
increased by the DTS environment because ep must release the CPU in each PS 
period. Thus, DTS system helps only the message passing applications. 



ep sinring 





Time(xlOOms) Time(xlOOms) 

Fig. 8. ep and sinring context switches 



5 Conclusions and Future Work 

The DTS environment, which implements explicit coscheduling of distributed 
tasks in a non dedicated NOW has been introduced. Studying the communica- 
tion architecture of the distributed applications and enabling each node of the 
system to change dynamically its configuration (dynamic and distributed), the 
communication performance of distributed applications was improved without 
damaging the performance of local ones excessively. Normally, the distributed 
tasks with high demand of CPU have not shown any improvement, but in these 
cases the overhead added to local tasks is not very significant. 

We are interested in developing more efficient methods for synchronization, 
thus decreasing the overhead introduced in the coscheduling of distributed tasks. 

In the future, we are interested in increasing the facilities of the DTS envi- 
ronment. The most important goal is to provide the DTS environment with the 
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ability to run, manage and schedule more than one distributed applications and 
modify the scheduler algorithm according to this new functionality. 

The dynamic mode algorithm of the DTS environment has a centralized na- 
ture. Furthermore the Console and the Master scheduler are localized in one 
module. If the Master module fails, the system goes down and there is no possi- 
bility of recovering the work done. Even if one node (not the Master) fails, the 
distributed application will stop abnormally and invalidate the execution of the 
distributed application, perhaps for a long period of time. The solution is to pro- 
vide the DTS environment fully distributed behavior. For the accomplishment 
of these objectives we have to study and propose new algorithms to implement 
fault tolerance. 

References 

1. Anderson, T., Culler, D., Patterson, D. and the Now team: A case for NOW (Net- 
works of Workstations). IEEE Micro (1995) 54-64 

2. Litzkow, M., Livny, M., Mutka, M.: Condor - A Hunter of Idle Workstations. 
Proceedings of the 8th Int’l Conference of Distributed Computing Systems (1988) 
104-111 

3. Russ, S., Robinson, J., Flachs, B., Heckel, B.: The Hector Distributed Run-Time 
Environment. IEEE trans. on Parallel and Distributed Systems, Vol.9 (11), (1988) 

4. Du, X., Zhang, X.: Coordinating Parallel Processes on Networks of Workstations. 
Journal of Parallel and Distributed Computing (1997) 

5. Crovella, M. et ah: Multiprogramming on Multiprocessors. Proceedings of 3rd 
IEEE Symposium on Parallel and Distributed Processing (1994) 590-597 

6. Dusseau, A., Arpaci, R., Culler, D.: Effective Distributed Scheduling of Parallel 
Workloads. ACM SIGMETRICS’96 (1996) 

7. Arpaci, R., Dusseau, A., Vahdat, A., Liu, L., Anderson, T., Patterson, D.: The 
Interaction of Parallel and Sequential Workloads on a Network of Workstations. 
ACM SIGMETRICS’95 (1995) 

8. Zhang, X., Yan Y.: Modeling and Gharacterizing Parallel Computing Performance 
on Heterogeneous Networks of Workstations. Proc. Seventh IEEE Symp. Parallel 
and Distributed Processing (1995) 25-34 

9. Ousterhout, J.: Scheduling Techniques for Concurrent Systems. Third International 
Conference on Distributed Computing Systems (1982) 22-30 

10. Solsona, F., Gine, F., Hernandez, P., Luque, E.: Synchronization Methods in Dis- 
tributed Processing. Proceedings of the Seventeenth lASTED International Gon- 
ference. Applied Informatics (1999) 471-473 

11. Ferrari, D., Zhou, S.: An Empirical Investigation of Load Indices for Load Balanc- 
ing Applications. Proc. Performance ’87, 12th Int’l Symp. Computer Performance 
Modeling, Measurement, and Evaluation. North-Holland, Amsterdam (1987) 515- 
528 

12. Parkbench Committe: Parkbench 2.0. http://www.netlib.org/park-bench (1996) 

13. Me. Voy, L., Staelin, C.: Imbench: Portable tools for performance analysis. Silicon 
Graphics. Inc, ftp://ftp.sgi.com/pub/lm-bench.tgz (1997) 




Enhancing Parallel Multimedia Servers through 
New Hierarchical Disk Scheduling Algorithms* 



Javier Fernandez, Felix Garcia, and Jesus Carretero 

Dto. de Informatica, Universidad Carlos III de Madrid, 
Avda. Universidad 30, Leganes , 28991, Madrid, Spain, 
{jfernand, fgarcia, jcarrete}@arcos . inf .uc3m. es 



Abstract. An integrated storage platform for open systems should be 
able of meeting the requirements of deterministic applications, multi- 
media systems, and traditional best-effort applications. It should also 
provide a disk scheduling mechanism fitting all those types of applica- 
tions. In this paper, we propose a three-level hierarchical disk scheduling 
scheme, which has three main components: metascheduler, single server 
scheduler, and disk scheduler. The metascheduler provides scheduling 
mechanisms for a parallel disk system or a set of parallel servers. The 
server level is divided in three main queues: deterministic, statistic and 
best-effort requests. Each server may have its own scheduling algorithm. 
The lower level, disk driver, chooses the ready streams using its own 
scheduling criteria. Those systems have been implemented and tested, 
and the performance evaluations demonstrate that our scheduling archi- 
tecture is adequate for handling stream sets with different timing and 
bandwidth requirements. 



1 Introduction 

Over the last years, there has been a great interest on the scheduling of I/O de- 
vices, usually disks, in computer systems [16,9]. However, the requirements and 
the platforms for both multimedia and general systems seemed to be so different 
[7], that people developed specialized systems. Thus, disk scheduling algorithms 
for general purpose systems tried to reduce the access time, while multime- 
dia systems tended to satisfy the real-time constraints for cyclical streams, even 
loosing performance. With the multimedia applications increasing, some authors 
[15] have proposed the design of a new kind of system, named integrated, that 
include facilities to support heterogeneous multimedia and general purpose in- 
formation. In an integrated system, the user may request the start of a new 
I/O request (stream) during run-time. The system must determine, following 
some admission criteria, whether the stream is schedulable to admit or reject 
the stream. The main problem with most of those systems is that they do not 
provide schedulability guarantees for deterministic applications [6]. 

* This work has been supported in part by the NSF Award CCR-9357840, the contract 
DABT63-94-C-0049 from DARPA and by the Spanish CICYT under the project 
TIC97-0955 
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In this paper we describe a hierarchical three-level disk scheduling scheme, 
which is actually used in MiPFS, a multimedia integrated parallel file system [3] . 
The scheduler has three main components: metascheduler, single server sched- 
uler, and disk scheduler. The metascheduler provides a scheduling mechanisms 
for parallel disk system or a set of parallel servers. The server level of the archi- 
tecture, is divided in three main queues: deterministic, statistic and best-effort 
requests. Each server may have its own scheduling algorithm. The lower level, 
disk driver, chooses the ready streams using its own scheduling criteria. We also 
propose an adaptive admission control algorithm relying on worst and average 
values of disk server utilization. Only streams satisfying the admission algo- 
rithm criteria [2] are accepted for further processing by the disk server. Section 
2 presents some related works. Section 3 describes the multi-level architecture of 
our scheduling scheme, including the extension to a parallel disk system. Section 
4 presents some performance evaluations of our scheduling architecture, which 
were first simulated, and then implemented and tested. Finally, section 5 shows 
some concluding remarks and future work. 



2 Related Work 

There are well known techniques to schedule deterministic real-time tasks [14,13] . 
However, most of them are oriented to fixed priority scheduling, or at most to 
dynamic priority scheduling in very specific situations. However, in I/O devices, 
the timing requirements of many streams are not known in advance, thus a 
global scheduling analysis can not be done [12]. The priority of the streams must 
be dynamically computed taking into account parameters, such as time, disk 
geometry, and quality of service granted to the application. Several algorithms 
have been proposed to satisfy this timing constraints in multimedia systems: 
EDF gives the highest priority to I/O streams with the nearest deadline [19], 
SCAN-EDF order the I/O streams by deadline and the stream with the same 
deadline by ascending order [16], and SCAN-RT orders the streams using an 
ascending order but taking into consideration the deadlines [11,5]. However, 
none of them addresses the problem of integrated environments, where multiple 
types of streams must be scheduled. An integrated scheduling scheme should 
try to reduce the answer time to best-effort requests, but it should also provide 
a time guided disk scheduler giving more priority to real-time stream with the 
nearest deadline. Because most of the disk scheduling schemes proposed up to 
now can hardly make a good trade-off between the former aspects, some multi- 
level hierarchical scheduling architectures have been proposed for deterministic 
tasks in an open environment [6], and for integrated multimedia systems [15]. In 
essence, those schemes create a slower virtual device for each stream or for each 
class of streams. 

We have implemented our scheduling architecture, shown in the next section, 
on a multiprocessor running LINUX [1,18]. We have chosen LINUX for several 
reasons: it is free, it is updated, and it can include new modules into the OS 
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without modifying the kernel continuously. The last feature is very useful to test 
different scheduling policies easily. 

3 Parallel Disk Scheduler Architecture 

This section describes the scheduling architecture proposed for MiPFS. It is 
named 2-Q [4] and it has been designed to handle both multimedia and high- 
performance applications, trying to maximize disk bandwidth while achieving 
I/O deadlines for multimedia and real-time I/O requests. The scheduler archi- 
tecture is shown in figure 1 and it is divided in two levels: single server and 
parallel scheduler. To explain the scheduler, the architecture for a single server 
is described first in this section. Then, it is scaled up to show the parallel sched- 
uler architecture {metascheduler) . Both components are described below. 



Admission 


Load 


QoS 


Control 


Distribution 


Control 


Meta 


Scheduler 




1 




Fig. 1. Parallel Disk Scheduling System Architecture 



3.1 Single Server Architecture 

Each server scheduler consists of two levels. An upper level with three stream 
server schedulers: deterministic, statistic, and best-effort streams. A lower level 
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including a disk driver scheduler ZJ, a ready queue R, and a disk. Thus, each 
scheduling decision involves two steps. First, each server scheduler receives 
streams, and, based on its particular scheduling algorithm, inserts them into 
corresponding place in its queue. Second, when the disk is free, D chooses one 
stream among the upper level queues, using its own scheduling algorithm, and 
put it into R. Statistic real-time streams allow a certain percentage of deadline 
misses, as far as the quality of service (QoS) of the client could be met, the service 
is deemed OK. Deterministic real-time streams do not allow any misses. In our 
prototype, the disk scheduling algorithm used at the server’s level are: EDF for 
deterministic, SCAN-EDF for statistic, and SCAN for best-effort streams. The 
disk driver scheduler D chooses the stream to be executed next based on the 
dynamic priorities previously computed. In our hierarchical system, each server 
queue has a different priority that is used by ZJ as a criteria to choose ready 
jobs. However, using only this priority criteria will be unfair for best-effort and 
statistic applications, leading to a high percentage of deadline misses, streams 
without a deadline. 

The intuition after our policy is that if the density of real-time streams is high, 
more disk serving time should go toward real-time streams. Otherwise, best-effort 
requests could be served, because the deadlines of the currently most urgent real- 
time streams are not likely to be missed even if the disk server turns to serve some 
best-effort requests firstly. As the results shown in section 4 corroborate, our 
scheduling architecture has two major features related to other disk scheduling 
schemes. First, the deterministic streams deadlines can he always met for streams 
admitted. Secondly, the average waiting time for best-effort requests is small. 

Algorithms 2 and 3 is used to compute the deadline of the streams and to 
insert the streams into the servers queues. Two major parameters are used in 
our scheduling scheme: deadline and service time. The service time is totally ap- 
plication dependent because it depends on the track number, while the deadline 
of a stream may be modified depending on the stream properties. Two kinds of 
deadlines are considered for a specific stream: application deadline, d, which is 
the one set by the application through the driver interface or other operations, 
such as QoS negotiation; scheduling deadline, I, which is a virtual deadline in- 
ternally used by the disk scheduler and computed by the server’s scheduler, so 
that I < d. The computation of the virtual deadline I is different for each kind of 
stream. For a best-effort request, I is originally set to a very large value that can 
be dynamically modified. For a statistic stream, I is the same as its actual dead- 
line d. A dynamic priority is then computed, based on the former parameters, 
and assigned to the request. 

The deadline for a deterministic request r, is computed using the algorithm in 
figure 2. The complexity of the scheduling deadline computation is 0(n), where 
n is the length of the deterministic server queue. For example, let’s assume five 
deterministic streams in So with the values of I and d shown in Table 1. Let’s 
also assume that the largest serving time for all of them is tg = 0.5 sec. Each 
li is calculated as di — 0.5. Since the I 3 of Requests can not be satisfied by I 2 
(5. 5-1-0. 5 > 5.5), I 2 is decreased by subtracting tg so that the scheduling deadline 
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1. Set the tentative seheduling deadline as Ir = dr — Xmax, in which dr is the real 
deadline of stream r, and Xmax is the estimated maximum service time for 

a deterministic request. 

2. Insert r into the deterministic queue So based on Ir- 

3. If a previous stream r — 1 exists, compute the service time for r with seek time 
relative to r — 1, 

Xr — tseek tlat “ 1 “ tfran 

4- Check whether the scheduling deadline of the next stream Ir+i, if exists, 
in So may be violatedfi.e., C + Xr lr+\). 

If so, decrease C to C+i — Xr to satisfy C+i- 

5. Compute the service time for r — 1, if exists, with seek time relative to r — 2 
by the formula shown in step 3. 

6. While {Ir < Ir-i + Xr-i) A (r — 1! = NULL) do 

(a) Decrease U-i to L — Xr-i to satisfy U- 

(b ) Make r = r — I and r — 1 be the previous stream r — 2 in Sa- 
fe) Compute Xr-i with seek time relative to r — 2. 

7. 7/ ((r - 1 ! = NULL) V {t + W < U)) then SUCCESS 

else FAILURE, 

where t is the current time, and W is the service time for 
the non preemptable stream being served. 



Fig. 2. Computing the Scheduling Deadline of a Deterministic Request 



of Requests can be satisfied (5.0 + 0.5 < 5.5). For the same reason, I 4 . must also 
be decreased to satisfy I 5 . The final scheduling deadlines are also shown in Table 
1 Whenever the disk server is free, the driver scheduler D selects a stream from 
the three server queues and inserts it into the ready queue. The algorithm used 
by D to choose a ready stream from the server queues, at any time t, is shown 
in figure 3. 



Request 


Request 1 


Request2 


Requests 


Request4 


Request 5 


Real -Deadline 


5.0 


6.0 


6.0 


7.0 


7.0 


Original Scheduling-Deadline 


4.5 


5.5 


5.5 


6.5 


6.5 


Final Scheduling-Deadline 


4.5 


5.0 


5.5 


6.0 


6.5 



Table 1. Deadlines of Deterministic Requests 



Obviously, scheduling deterministic streams with high priority means again 
that some statistic streams would miss their deadlines. An alternate approach 
can be used to reduce the number of missed deadlines: temporal degradation 
of the QoS of some streams. Usually some statistic applications, such as video- 
conferencing, can degrade temporally its QoS to benefit other deterministic ap- 
plications, such as telesurgery. To use this scheme, each stream should specify 
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1. if length(So) 7 ^ 0 V length{S\ ^ 0) Choose the stream r with a smaller deadline 
from So and Si, the two real-time queues to be served. 

2. Compute real-time candidate aetual service time, 

Xr = tr„,,,,f, + + trtran 

3. if {{t + Xr) < Ir) V {{t + Xr) < dr) Candidate = r 

else discard r and notify failure. 

4. if pXr < Ir —t, where P > 1 is a ’’maximum” slaek factor, 

(a) Search S2 to find nr, where Vi e S2, being h the eurrent disk head position 
and Pi the track streamed by i, min{\pi — h\) is \p„r — h\. 

(b) Compute aetual nr serviee time, 

Xnr = tnr^„„f, + tnri„,t + tnrtran 

(c) Compute taiot = a{xr + Xnr), wheve a > 1 is a ’’minimum” slack factor 

(d) iftsiot < Ir — t candidate=nr 

else candidate = r 

5. Insert candidate into the ready queue of D 



Fig. 3. Algorithm to Choose a Ready Task from the Server Queues 



the average QoS required and the percentage of temporal degradation during a 
maximum time. Then, the priority of its requests could be reduced to satisfy the 
new requirements. 

3.2 Metascheduler Architecture 

The meta-scheduler (Figure 1) is a coordinator, whose function relies in three 
aspects: decomposing the incoming streams and dispatching them to the corre- 
sponding disk servers; gathering portions of data from different disk servers and 
sending them back to the applications; and coordinating the service to a specific 
stream among different disk servers. The hrst two functions are included into 
the parallel disk system interface, while the third one is internal and acts as an 
intelligent inspector among different disk servers. When a new stream arrives, 
it is decomposed into substreams by the meta-scheduler and each substream is 
sent to the appropriated disk server. The meta-scheduler gathers the informa- 
tion from the servers and returns to the application the status of the admission 
tests: successful if all disk servers admitted the new stream, failed in other cases. 
When successful, the met a- scheduler asks the disk servers to commit the re- 
sources. A problem incurred with stream distribution is that some substreams 
could be successful while other could fail. The meta-scheduler gathers the status 
of the substreams and, if there are some deadline misses and the QoS is below 
the required, notifies a failure to the application. Moreover, it notifies to the re- 
maining involved servers to abort the stream and to release the budget allocated 
to this stream. To accomplish it, each stream is assigned a unique id number, 
which is shared by all of its sub-streams , and inserted in a stream dispatching 
table. Whenever a disk server fails to serve a sub-stream, the meta-scheduler is 
informed. According to the unique id number, the meta-scheduler changes the 
status of all the sub-stream corresponding to this stream to failed, informing 
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other disk servers of this situation. As a result, all the pending sub-streams of 
this stream are deleted from the queue in each disk server, and the resources are 
freed. This policy avoids wasting resources on failed streams, transferring those 
resources to other successful streams. 

4 Performance Evaluation 

The performance results presented in this section were collected on a Silicon 
Graphics Origin (SGO) machine located at the CPDG of Northwestern Univer- 
sity, and on a Pentium biprocessor, with four disks, located at the DATSI of 
Universidad Politenica de Madrid. The SGO was used to test the scalability of 
our solution on a parallel system including several servers. To test the perfor- 
mance and behavior of our solution, a video server and several remote clients 
were used. The video server transmitted several movies attending to client re- 
quests. The duration of the movies were 30 minutes approximately. 




Number of Processes 



♦ FIFO 
▼ SCAN 
A CSCAN 
► 2-Q 



Fig. 4. Disk bandwidth with different scheduling policies 



To test the features of the scheduler, with different scheduling policies, four 
parameters were studied and evaluated: 

1. Disk bandwidth. Several processes executing simultaneous read/write opera- 
tions. 

2. Disk answer time. Several video streams executing read operations simulta- 
neously to best-effort requests. 

3. Single server performance. Several video clients accessing MPEG videos 
through a 100 Mb/s Ethernet. 
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4. Parallel server performance Several video clients accessing MPEG videos 
sorted on several servers through a high bandwidth bus. 

The former experiments were executed using several popular disk scheduling 
policies (FIFO, SCAN, CSCAN, EDF, and SCAN-EDF) to compare them with 
our 2-Q algorithm. 

Figure 4 shows the aggregated bandwidth for a single disk using several 
scheduling policies. Each process of the experiment executes a multimedia work- 
load including deterministic and best effort requests. The results obtained show 
that the bandwidth is higher for 2-Q than for the other algorithms when three 
or more processes are executed. Specifically, the results of 2-Q are better than 
those of CSCAN, which is typically used in disk drivers. These results can be 
explained because the deterministic requests are prioritized over the best-effort 
ones when using our scheduler. By doing that, more contiguous I/O requests are 
sent to the disk, and the seek time is reduced. With one two proceses requesting 
I/O, 2-Q performance is similar to the other algorithms. With only one process, 
the results are worst for 2-Q due to the overhead of the protocol to insert the 
requests in both queues. Future versions of the algorithm will be optimized to 
solve this problem. 




Fig. 5. Response Time for Best Effort Requests. 



Figure 5 tries the result of a experiment running a fixed number of best 
effort requests and an increasing number of periodic streams. As shown in the 
figure, our scheduler has a better response time for best-effort requests than 
those of EDF and SCAN-EDF, typically used in continuous media systems. 
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These results can be explained because of the opportunistic behavior of our disk 
driver scheduler, that serves best-effort requests in CSCAN order whenever it 
has some free time in a round. It not only reduces the answer time, but also 
minimizes the average seek time for best-effort requests. Those features are not 
present in EDF or SCAN-EDF. 

To test the performance of a single server, we used several video clients 
accessing MPEG videos through a 100 Mb/s Ethernet. All the videos were stored 
on a single server (the Pentium machine). Several priority levels were used to 
accomplish this test. Figure 6 shows that the deterministic clients maintain the 
number of frames per second while the best-effort clients performance is worst 
when the number of video clients is increased. 




Fig. 6. Influence of priority on the number of frames per second 



To evaluate the behavior of our parallel disk server scheduler, an increasing 
workload was applied to a parallel disk server whose number of disk servers was 
varied from 1 to 16. This experiment was executed on the SGO machine. We 
wanted to measure the maximum number of statistic streams (periodic) served 
before having a deadline miss. Figure 7 shows the results of test 4. The workload 
was composed of deterministic sporadic streams, statistic periodic streams, and 
best-effort requests. As can be seen, the 2-Q algorithm always provides the same 
or better results than the others. That means that 2-Q can serve more statistic 
clients before having a deterministic deadline missed. 

5 Summary and Concluding Remarks 

In this paper, we presented a solution for the scheduling problem in a paral- 
lel integrated storage platform which can satisfy the requirements of real-time 
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applications, multimedia systems, and traditional best-effort applications. First, 
we motivated the need of such a solution, then presented the architecture used 
in our 2-Q scheduling architecture. 2-Q has a hierarchical two- level architecture 
where the upper level, or server level, is divided in three queues for deterministic, 
statistic and best-effort requests, each one using a scheduling algorithm specihc 
for that server. The solution proposed for one disk served was generalized for a 
parallel disk server by using a meta-scheduler to control the achievement of the 
deadlines of a parallel stream. 

Performance evaluations, made on a Pentium biprocessor and a SCO ma- 
chine, demonstrate that our scheduling architecture is adequated for handling 
stream sets with different deterministic, statistic, or best-effort requirements. 
Moreover, it maximizes the bandwidth of the disk, while minimizing the aver- 
age answer time for best-effort requests. The results of the evaluation of the 
parallel disk scheduling architecture demonstrates that the fact of satisfying the 
deterministic requested does not diminished the scalability of the solution when 
several disks are used. 
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Abstract. This article is devoted to the redistribution of one-dimensional arrays 
that are distributed in a GEN_BLOCK fashion over a processor grid. While 
GEN_BLOCK redistribution is essential for load balancing, prior research 
about redistribution has been focused on block-cyclic redistribution. The 
proposed scheduling algorithm exploits a spatial locality in message passing 
from a seemingly irregular array redistribution. The algorithm attempts to 
obtain near optimal scheduling by trying to minimize communication step size 
and the number of steps. According to experiments on CRAY T3E and IBM 
SP2, the algorithm shows good performance in typical distributed memory 
machines. 



1. Introduction 

The data parallel programming model has become a widely accepted paradigm for 
programming distributed memory multicomputer. Appropriate data distribution is 
critical for efficient execution of a data parallel program on a distributed memory 
multicomputer. Data distribution deals with how data arrays should be distributed to 
each processor. The array distribution patterns supported in High Performance Fortran 
(HPF) are classified into two categories - basic and extensible. Basic distributions like 
BLOCK, CYCLIC, and CYCLIC(n) are useful for an application that shows regular 
data access patterns. Extensible distributions such as GEN BLOCK and INDIRECT 
are provided for load balancing or irregular problems [1]. 

The array redistribution problem has recently received considerable attention. This 
interest is motivated largely by the HPF programming style, in which science 
applications are decomposed into phases. At each phase, there is an optimal 
distribution of the arrays onto the processor grid. Because the optimal distribution 
changes from phase to phase, the array redistribution turns out to be a critical 
operation [2][11][13]. 
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Generally, the redistribution problem comprises the following two subproblems 
[7][11][13]: 

• Message generation: The array to be redistributed should be efficiently scanned or 
processed in order to build all the messages that are to be exchanged between 
processors. 

• Communication schedule: All the messages must be efficiently scheduled so as to 
minimize communication overhead. Each processor typically has several messages 
to send to all other processors. 

This paper describes efficient and practical algorithms for redistributing arrays 
between different GEN BLOCK distributions. The "generalized" block distribution, 
GEN BLOCK, which is supported in High Performance Fortran version 2, allows 
contiguous segments of an array, of possibly unequal sizes, to be mapped onto 
processors [1], and is therefore useful for solving load balancing problems [14]. The 
sizes of the segments are specified by values of a user-defined integer mapping array, 
with one value per target processor of the mapping. 

Since the address calculation and message generation processes in GEN BLOCK 
redistribution are relatively simple, this paper focuses on an efficient communication 
schedule. The simplest approach to designing a communication schedule is to use 
nonblocking communication primitives. The nonblocking algorithm, which is a 
communication algorithm using nonblocking communication primitives, avoids 
excessive synchronization overhead, and is therefore faster than the blocking 
algorithm, which is a communication algorithm using blocking communication 
primitives. However, the main drawback of the nonblocking algorithm is its need for 
as much buffering as the data being redistributed [6][1 1][12]. 

There is a significant amount of research on regular redistribution - redistribution 
between CYCLIC(n) and CYCLlC(m). Message passing in regular redistribution 
involves cyclic repetitions of a basic message passing pattern, while GEN_BLOCK 
redistribution does not show such regularity. There is no repetition, and there is only 
one message passing between two processors. There is presently no research 
concerning GEN BLOCK redistribution. This paper presents a scheduling algorithm 
for GEN BLOCK redistribution using blocking communication primitives and 
compares it with nonblocking one. 

The paper is organized as follows. In Section 2, the characteristics of blocking and 
nonblocking message passing are presented. Some concepts about and definitions of 
GEN BLOCK redistribution are introduced in Section 3. In Section 4, we present an 
optimal scheduling algorithm. In Section 5, a performance evaluation is conducted. 
The conclusion is given in Section 6. 

2. Communication Model 

The interprocessor communication overhead is incurred when data is exchanged 
between processors of a coarse-grained parallel computer. The communication 
overheads can be represented using an analytical model of typical distributed memory 
machines, the General purpose Distributed Memory (GDM) model [7]. Similar 
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models are reported in the literature [2], [5], [6] and [11]. The GDM model represents 
the communication time of a message passing operation using two parameters: the 
start-up time 4 and the unit data transmission time 

The start-up time is incurred once for each communication event. It is independent 
of the message size to be communicated. This start-up time consists of the transfer 
request and acknowledgement latencies, context switch latency, and latencies for 
initializing the message header. The unit data transmission time for a message is 
proportional to the message size. Thus, the total communication time for sending a 
message of size m units from one processor to another is modeled as Expression 1 . A 
permutation of the data elements among the processors, in which each processor has 
maximum m units of data for another processor, can be performed concurrently in the 
Expression 1 time. The communication time of collective message passing delivered 
in multiple communication steps is the sum of the communication time of each step as 
shown in Expression 2. 



'^STEP K ^ ^ 

T = T r 

TOTAL ^ STEP 



( 1 ) 

( 2 ) 



Since the model does not allow node contention at the source and destination of 
messages, a processor can receive a message from only one other processor in every 
communication step. Similarly, a processor can send a message to only one other 
processor in each communication step. Contention at switches within the 
interconnection network is not considered. The interconnection network is modeled as 
a completely connected graph. Hence, the model does not include parameters for 
representing network topology. This assumption is realistic because of architectural 
features such as virtual channels and cut-through routing in state-of-art 
interconnection networks. Also, the component of the communication cost that is 
topology dependent is insignificant compared to the large software overheads 
included in message passing time. 



3. GEN_BLOCK Redistribution 

This section illustrates the characteristics of GEN BLOCK redistribution and 
introduces several concepts. GEN BLOCK, which allows each processor to have 
different sizes of data blocks, is useful for load balancing [1]. Since, for dynamic load 
balancing, current distribution may not be effective at the next time, the redistribution 
between different GEN_BLOCKs like Fig. 1(a) is necessary. 

The GEN BLOCK redistribution causes collective message passing, as can be 
seen in Fig. 1(b). Unlike regular redistribution, which has a cyclic message passing 
pattern the message passing occurs in a local area. Suppose there are four processors: 
Pi, Pl-1, PI, Pl+1. If Pi has messages for Pl-1 and Pl+1, there should be a message 
from Pi to PL In the same way, if there are messages from Pl-1 to Pi and from Pl+1 to 
Pi, there should be a message from PI to Pi. For example, in Fig. 1(b), P2 sends 
messages to P2, P3, and P4, and P7 receives messages from P5, P6, and P7. We call 
this "Spatial Locality of Message Passing." 
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Same sender (or receiver) messages cannot be sent in a communication step. In this 
paper, we call a set of those messages "Neighbor Message Set." One characteristic of 
a neighbor message set is that there are, at most, two link messages. As shown in Fig. 
1(c), some messages can be included in two neighbor message sets; we call this "Link 
Message." Another characteristic is that the chain of neighbor message sets is not 
cyclical, which means that there is an order between them. We define "Schedule" as a 
set of communication steps. In this paper, we assume that the communication steps 
are ordered by their communication time. Fig. 1(d) is an example of the schedule. The 
schedule completes redistribution with three communication steps. 

There may be many schedules for a redistribution. Variations of a schedule are 
easily generated by changing the communication steps of a message. For example, by 
moving my to Step 1, we can make a new schedule. Flowever, ms is already in Step 1, 
and since m^ and my are in the same neighbor set, ms must be moved to a different 
step. If mg moves from Step 3 to Step 2, then m 2 , 1114 , and ms as well as my and m* 
have to be relocated. We define these two message sets that cannot be located in the 
same communication step as a "Conflict Tuple." For example, between Step 2 and 
Step 3, there are two conflict tuples: ({m 3 , my}, (ms, mg}) and ({mi 4 },{mis}). Fig. 1(e) 
shows all conflict tuples of the schedule in Fig. 1(d). 
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4. Relocation Scheduling Algorithm 

To save redistribution time, it is necessary to reduce the number and the size of 
communication steps. In this section, we propose two optimal conditions and an 
optimal scheduling algorithm for GEN_BLOCK redistribution. 



4.1 Minimal Step Condition 

According to Expressions 1 and 2, the influence of message suffling pattern or non- 
maximum message sizes in a communication step are not crucial, but the 
communication time linearly increases as the number of communication steps 
increase. Therefore, reducing the number of communication steps is important. In this 
section, we propose a minimal step condition for GEN BLOCK redistribution. 

• Minimal Step Condition: Minimum number of communication steps is maximum 

cardinality of neighbor sets. 

Since the messages in a neighbor message set cannot be sent in a communication 
step, there should be more communication steps than the maximum cardinality of 
neighbor message sets. According to the characteristics of neighbor message sets, 
there are at most two link messages in a neighbor message set. All the link messages 
can be located within two communication steps by using a zigzag locating mechanism 
like mi, m 2 , m 3 and m 4 , shown in Fig. 1(d). The remaining messages in neighbor sets 
are then all independent and possible to locate in any communication step. Therefore, 
the minimum number of communication steps is maximum cardinality of neighboring 
sets. 



4.2 Minimal Size Condition 

As shown in Expression 1, reducing the sizes of maximum messages in each step is 
important for efficient redistribution. In this section, we propose another optimal 
scheduling condition for the GEN BLOCK redistribution. 

• Minimal Size Condition: For all conflict tuples (M, M') in Schedule C, SZm>SZi^'. 

The SZff is a size of A. N can he a message, a communication step, or a schedule. If 
A is a message, SZ}^ is the size of the message. If A is a communication step, SZ{^ is 
the size of maximum message in A. And if A is the schedule, SZ}^ is a sum of the size 
of communication steps in A. 

Suppose a communieation schedule Cl has a conflict tuple that does not hold the 
minimal size condition, which means that there is a conflict tuple (M, M) between 
Communication Steps Si and Sj (i > j) such that SZ^ < SZ^'- Let's consider another 
schedule, C2, that is the same as Cl except for the Message Set M and M'. In C2, M is 
in Sj and M' is in Sj. To understand this easily, we express Sj and Sj in Cl as Sli and 
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Slj, and Sj and Sj in C2 as S2j and S2j. Because of the definitions of Communication 
Step and Schedule, and Expressions 1 and 2, SZgjj > SZ^ and SZgjj ^ SZ^’- If SZgn > 
SZm and SZsjj = SZ^', then SZg 2 i > SZ^’, therefore SZa = SZc 2 - In case SZsn > SZ^ and 
SZsjj > SZi 4 ', if there is a message m in S2 such that 5Z„, is less than SZ^' but greater 
than any other messages in S2, then SZc 2 = SZqi SZ^' -SZ„, otherwise SZc 2 = SZci - 
SZm’ -SZm- In both cases, SZc 2 < SZcj- Therefore, if there is a schedule Cl that does 
not hold the minimal size condition, then we can always makes a schedule C2 that is 
not worse than Cl by changing the communication steps of M and M'. 

Unfortunately, this proof does not guarantee that all schedules that satisfy the 
minimal size condition are optimal. However, according to the various experiments in 
Section 5, we show that a schedule that satisfies the condition shows better 
performance. 



4.3 Relocation Scheduling Algorithm 

Algorithm 1 is a scheduling algorithm that generates an optimal communication 
schedule. The schedule satisfies minimal step condition and minimal size eondition. 

The algorithm consists of two phases. In the first phase, it performs size-oriented 
list scheduling [15] (lines 6-12). In the processing of the first phase, when it cannot 
allocate a message without violating the minimal step condition, it goes to the next 
phase - the reloeation phase. In the relocation phase, it relocates already allocated 
messages and makes space for the new message. 
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Some redistributions can be scheduled through only the list scheduling phase. In 
this case, because a larger message is located in an earlier step, the minimal size 
condition is satisfied. The schedule in Fig. 1(d) is generated through only the list 
scheduling phase. The redistribution in Fig. 2(a) is an example that cannot be 
scheduled only through list scheduling. Because the maximum cardinality of its 
neighbor sets is 2, the messages have to be scheduled within two communication 
steps. 



Algorithm: locate 

input M : Message Set sorted by size and reservation 
rn : reserved neighbor set 
output S : Schedule 

{ 

1 Sort(M); 

2 SNT(1,rn)=1 

3 for (i=0; i<card(M); i++) { 

4 m = M(i); 

5 for (j=0; j<card(S); j++) { 

6 s = SG) 

7 If ((SNT(s,nb(m,1))==0) && (SNT(s,nb(m,2))==0)) { 

8 SNT(s,nb(m,l))= 1; 

9 SNT(s,nb(m,2))= 1; 

10 Insert(m, s); 

11 Go to next; 

12 } } 

13 LS = replace(S, locate(LM, nb(m,l))); 

14 RS = replace(S, locate(RM, nb(m,2))); 

15 if(sz(LS)<sz(RS))S = LS; 

16 elseS = RS; 

17 next: 

18 } 

19 return S; 

} 

- sort(M) sorts the messages in M by size. 

- SNT(s,n) is flag to indicate "there is n-th neighbor set message in step s." 

- card(S) is a cardinality of set S. 
nb(m,i) is i-th neighbor set of message m. 

Insert(m,s) inserts message m into step s. 

replace(S,S') returns a new schedule in which the position of m in S are replaced 
as that in S'. 

sz(S) is a redistribution size of schedule S. 



Algorithm 1. Relocation Scheduling Algorithm 
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The input message set M of the relocation scheduling algorithm is {mi, vcif,, m3, 1114, 
my, ms, m 2 , mg}. The algorithm schedules mi, m^, m 3 , m 4 , and my with list scheduling, 
hut because there are already m 6 in Step 1 and 1114 in Step 2, ms cannot be located in 
both Step 1 and Step 2. The control is then passed to the relocation phase. Because the 
message ms is included in neighbor set N4 and N5, it has to be placed in ® or ©. In 
the relocation process, the schedule is divided into two sub message sets. One, which 
we term the left set, is composed of the messages before N4, and the other, termed the 
right set, is composed of the messages after N5. The left set and the right set are kinds 
of GEN BLOCK redistributions and can be seen as an input of the relocation 
scheduling algorithm, recursively. In case of the left set, to make a space in Step 2 for 
ms, m 4 is reserved in Step 1 before the recursive call. When the recursive call returns, 
we get a new subschedule in Fig. 2(c), merge it with the original schedule, and make a 
new schedule in Fig. 2(e). In case of the right set, we also make another schedule in 
Fig. 2(f), but we discard it because it does not satisfy the minimal size condition. 

5. Performance Evaluation and Experimental Results 

This section shows timing results from implementations of our redistribution 
algorithm on CRAY T3E and IBM SP2. The algorithms and node programs were 
coded in Fortran 77, and MPl primitives were used for interprocessor communication. 
We use a random function for a series of GEN BLOCK distributions in the 
experiment. 

To evaluate the relocation scheduling algorithm in various redistribution 
environments, we have assumed three changing load situations: stable, moderate, and 
unstable. In the stable case, the changes are not heavy. Therefore, there are a small 
number of messages, and the average size of the messages is also relatively small. To 
make the series of distributions for this case, the standard deviation of block size is 
limited to less than 10 % of average block size. As the situation becomes moderate 
and unstable, the changing loads in each processor become heavy. Thus, for the 
moderate and unstable situations, we assume that the standard deviation of block sizes 
is between 45 % and 55 % and between 90 % and 100 % of average block size for 
each. 

In order to evaluate the effects of the minimal size condition and the minimal step 
condition, we use the following five scheduling algorithms. 

• NBLK is a schedule for a redistribution that uses non-blocking communication 
primitives. 

• OPT is an optimal schedule for blocking communication that satisfies the minimal 
step condition and the minimal size condition. The schedule is generated by the 
relocation scheduling algorithm. 

• M1N_STEP is a minimal step schedule using blocking communication primitives. 
For this schedule, we use a random list schedule, and to keep the minimal step 
condition, we use the relocation phase of the relocation algorithm. 

• MIN SIZE is a minimal size schedule using blocking communication primitives. 
For this schedule, we use the size-oriented list schedule. 
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• RANDOM is a random schedule using blocking communication primitives. For 
this schedule, we simply use a random list schedule. Therefore, it does not 
guarantee both the minimal step condition and the minimal size condition. 
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Fie. 3. Redistribution Times at CRAY T3E 




Fig. 4. The effects of number of processor and array size at CRAY T3E 
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5.1. Results from Cray T3E 

Fig. 3 shows the difference between performances through the algorithms on Cray 
T3E. Here, the performance of the NBLK, OPT, MIN STEP, MIN SIZE, and 
RANDOM schedules are measured on 16 processors. The size of the array used is IM 
floating points. From Fig. 3(a), as expected, the NBLK schedule shows the best 
performance. Between the schedules using blocking primitives, it can be seen that the 
OPT schedule takes the least amount of redistribution time, whereas the RANDOM 
schedule takes the longest. The redistribution times for the MIN_SIZE schedule and 
the M1N_STEP schedule vary. 

Fig. 3(b) shows redistribution size of each algorithm. The redistribution size is the 
sum of the maximum message sizes of each step. According to Fig. 3(b), as the 
changing load situation becomes worse, the redistribution sizes increase. In the graph, 
we can see that the gradients of the M1N_STEP schedule and the RANDOM 
schedules are steeper than those of the OPT schedule and the MIN SIZE schedule. 
Because of the influence of the redistribution size, the graph in Fig. 3(a) shows that 
the redistribution time of MIN STEP schedule and the RANDOM schedule become 
longer more quickly than those of the OPT schedule and the MIN_SIZE schedule. 

An influence of the communication step is inferred from the gap between the 
redistribution time of the OPT schedule and the MIN SIZE schedule. According to 
Expressions 1 and 2, there are two factors that determine the redistribution time: the 
number of communication steps, and the redistribution size. As shown in Fig. 3(b), 
the redistribution size of the two schedules is almost the same. Hence, the difference 
in redistribution time is caused by the slight gap in the communication step. 

The trends in Fig. 3 are shown in experiments performed with different processor 
numbers and different array sizes. In Fig. 4(a), the array size varies from 64K to 4M, 
and in Fig. 4(b), the number of processors varies from 4 to 64. From these graphs, we 
observed that, between redistribution schedules for blocking communication 
primitives, the OPT schedule achieves better performance than the other schedules. 
The results presented in this section show that for GEN_BLOCK redistribution, it is 
essential to satisfy the minimal step condition and the minimal size condition. 



5.2. Results from IBM SP2 

There are slightly different results when the experiment is performed in IBM SP2. In 
this section, we present and analyze the results from IBM SP2. 

According to Fig. 5, OPT shows similar speed with MIN SIZE, and MIN STEP 
schedule is even worse than RANDOM schedule. These mean that, in IBM SP2, 
Minimal Step Condition is not helpful for redistribution, and sometimes it is harmful. 

It is because IBM SP2 does not satisfy the assumed communication model, GDM. 
As stated in section 2, the interconnection network in GDM is completely connected 
graph, and contention at switch in interconnection is not considered. To examine IBM 
SP2 and CRAY T3E satisfy GDM model, we perform an experiment in Fig. 6. The 
experiment checks communication time of different shuffling situations, such as: one 
directional message passing between two processors (ID), bi-directional two message 
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Fig. 5. Redistribution Times in IBM SP2 

passings between two processors (2D), shift style three message passings between 
four processors (SH) and cyclic shift style four message passings between four 
processors, and in case 2D, we changed one message to be 1/4, 2/4, 3/4 and 4/4 of the 
other. In this experiment, CRAY T3E shows relatively fixed performance, but IBM 
SP2 takes various communication times according to the number of messages and 
message size. SP2 takes more time in 2D, SH and C-SH than ID, and it also takes 
more time as message size increases in the variation of 2D. These mean that 
Expression 1 is no longer valid in SP2, and increasing the degree of parallelism is not 
always good. This might be due to the architectural difference between the two 
systems. The interconnection network of CRAY T3E is a 3D Torus and that of IBM 
SP2 is a multistage interconnection network (MIN); it is generally known that the 
probability of contention in MIN is higher than Mash or Torus structure such as 
CRAY T3E[16][17]. 

These experiments demonstrate that the proposed scheduling algorithm shows 
good performance when the communication model is close to GDM model. 



6. Related Work 

Many methods for performing array redistribution have been presented in the 
literature. 

Kalns et al. proposed a processor mapping technique to minimize the amount of 
data exchange for BLOCK-to-CYCLIC(n) redistribution. Using the data to logical 
processors mapping, they showed that the technique can achieve the maximum ratio 
between data retained locally and the total amount of data exchanged [9]. 
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Fig. 6. Communication Time of message sufflings in (a) CRAY T3E and (b) IBM SP2 



Gupta et al. derived closed form expressions for determining the send/receive 
processor/data sets for the BLOCK-to-CYCLIC redistribution. They also provided a 
virtual processor approach to address the problem of reference index-sets 
identification for array statements with CYCLIC(n) distribution and formulated active 
processor sets as closed forms [10]. 

Thakur et al. proposed algorithms for runtime array redistribution algorithm for 
BLOCK-to-CYCLIC(m) and CYCLIC(n)-to-CYCLIC(kn). Based on these 
algorithms, they proposed a two phase redistribution method for CYCLIC(m)-to- 
CYCLIC(n) using LCM or GCD of m and n [4]. In the same paper, Thakur et al. 
proposed some ideas that are generally accepted in the later papers. One is a general 
redistribution mechanism, LCM method, for redistribution between CYCLIC(m)-to- 
CYCLIC(n). Another is the effect of indirect communication, which was expanded to 
multiphase redistribution mechanism by Kaushik et al. [5]. The last is an evaluation of 
the non-blocking communication, which Thakur et al. presented as more efficient than 
blocking conununication because the computation and communication are performed 
in parallel. Following research has focused on the redistribution between CYCLIC(n)- 
to-CYCLIC(kn) using non-block communication primitives. 

Walker et al. posited that the performance of the redistribution algorithm using 
block communication primitives was comparable to that of nonblocking redistribution 
[ 6 ]. 

Lim at el. in [7] proposed a generalized circulant matrix formalism to reduce the 
communication overheads for CYCLIC(n)-to-CYCLIC(kn) redistribution. Based on 
the generalized circulant matrix formalism, they proposed direct, indirect, and hybrid 
communication schedules. They demonstrated that an indirect communication 
schedule using blocking primitives shows better performance than redistribution using 
asynchronous communication primitives. 
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7. Conclusion 

In this paper, we propose a new scheduling algorithm for redistribution between 
different GEN BLOCKs. 

In spite of excessive synchronization overhead, many programs use blocking 
communication primitives because of the cost of communication buffers. However, to 
avoid deadlock and poor performance, the messages using blocking communication 
primitives have to be well scheduled. This paper analyzes the characteristics of 
communication primitives in MPl and proves that a Minimal Step Condition and a 
Minimal Size Condition are essential in blocking GEN BLOCK redistribution. 
Moreover, by adding a relocation phase to list scheduling, we make an optimal 
scheduling algorithm: a relocation scheduling algorithm. 

In section 5, various experiments on CRAY T3E and IBM SP2 were performed to 
evaluate the proposed algorithm and analyze the influences of the optimal conditions 
in a real environment. According to the experiments, it was proven that the relocation 
scheduling algorithm shows better performance and that the optimal conditions are 
critical in enhancing the communication speed of GEN BLOCK redistribution in 
GDM model. 
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Abstract The SCOOPP (Scalable Object Oriented Parallel Programming) 
system is an hybrid compile and run-time system. SCOOPP dynamically scales 
OO applications on a wide range of target platforms, including a novel feature 
to perform a run-time packing of excess parallel tasks. This communication 
details the methodology and policies to pack parallel objects into grains and 
method calls into messages. The SCOOPP evaluation focus on a pipelined 
parallel algorithm - the Eratosthenes sieve - which may dynamically generate a 
large number of fine-grained parallel tasks and messages. This case study shows 
how the parallelism grain-size - both computational and communication - has a 
strong impact on performance and on the programmer burden. The presented 
performance results show that the SCOOPP methodology is feasible and the 
proposed policies achieve efficient portability results across several target 
platforms. 



1 Introduction 

Most parallel applications require parallelism granularity decisions: a larger number of 
fine parallel tasks may help to scale up the parallel application and it may improve the 
load balancing. However, if parallel tasks are too fine, performance may degrade due 
to parallelism overheads, both at the computational and communication level. 

Static granularity control, performed at compile-time, can be efficiently applied to 
fine grained tasks [1][2], whose number and behaviour is known at compile-time. HPF 
[3], HPC++ [4], and Elbe [5] are examples of environments that support static 
granularity control. However, parallel applications where parallel tasks are 
dynamically created and whose granularity can not be accurately estimated at 
compile-time require dynamic granularity control to get an acceptable performance; 
this also applies when portability is required across several platforms. 

Granularity control can lead to better performance when performed by the 
programmer, but it adds an extra burden on the programmer activity: it requires 
knowledge of both the architecture and the algorithm behaviour, and it also reduces 
the code clarity, reusability and portability. 
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The SCOOPP system [6] is an hybrid compile and run-time system, that extracts 
parallelism, supports explicit parallelism and dynamically serialises parallel tasks in 
excess at run-time, to dynamically scale applications through a wide range of target 
platforms. This paper evaluates the application of the SCOOPP methodology to 
dynamically scale a pipelined application - the Eratosthenes sieve - on three different 
generations of parallel systems: a 7 node Pentium 11 350MHz based cluster, running 
Linux with a threaded PVM on TCP/IP, a 16 node PowerPC 601 66 MHz based 
Parsytec PowerXplorer and a 56 node T805 30Mhz based Parsytec MultiCluster 3, 
both running PARIX with proprietary communieation primitives, funetionally 
identical to PVM. The cluster nodes are inter-cormected through a 1 GBbit Myrinet 
switch, the PowerXplorer nodes use a 4x4 mesh of 10Mbit Transputer-based 
connections and the MultiCluster Transputers are interconnected through a 7x8 mesh. 

Section 2 presents an overview of the SCOOPP system and its features to 
dynamically evaluate the parallelism granularity and to remove excess parallelism. 
Section 3 introduces the Eratosthenes sieve and presents the performance results. 
Section 4 concludes the paper and presents suggestions for future work. 



2 SCOOPP System Overview 

SCOOPP is based on an object oriented programming paradigm supporting both 
active and passive objects. Active objects are called parallel objects in SCOOPP 
(//obj) and they specify explicit parallelism. These objects model parallel tasks and 
may be placed at remote processing nodes. They communicate through either 
asynchronous or synchronous method calls. 

Passive objects are supported to take advantage of existing code. These objects are 
placed in the context of the parallel objeet that created them, and only copies of them 
are allowed to move between parallel objects. Method ealls on these objects are 
always synehronous. 

Parallelism extraction is performed by transforming selected passive objects into 
parallel objects (more details in [7]), whereas parallelism serialisation (i.e. grain 
packing) is performed by transforming parallel objects into passive ones [8]. 

Granularity control in SCOOPP is accomplished in two steps. At compile-time the 
compiler and/or the programmer specifies a large number of fine-grained parallel 
objects. At run-time parallel objeets are packed into larger grains - according to the 
applieation/target platform behaviour and based on security and performance issues - 
and method calls are packed into larger messages. 

Packing methodologies are eoneemed on “how” to pack and “which” items to pack; 
this subject is analysed in section 2.1. These methodologies rely on parameters, whieh 
are estimated to control granularity at run-time; these are analysed on section 2.2. 
Packing policies focus on “when” and “how much” to pack, and they heavily rely on 
the structure of the application; this subject is analysed in section 2.3. 
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2.1 Run-Time Granularity Control 

Conventional approaches for run-time granularity control are based on fork/join 
parallelism [9][10][1 1][12][13]. The grain-size can be increased by ignoring the fork 
and executing tasks sequentially, avoiding spawning a new parallel activity to execute 
the forked task. 

The SCOOPP system dynamically controls granularity by packing several //obj 
into a single grain and serialising intra-grain operations. Additionally, SCOOPP can 
reduce inter-grains communication by packing several method calls into a single 
message. 



Packing Parallel Objects. The main goal of object packing is to decrease parallelism 
overheads by increasing the number of intra-grain operations between remote method 
calls. Intra-grain method calls - between objects within the same grain - are 
synchronous and usually performed directly as a normal procedure call; asynchronous 
inter-grain calls are implemented through standard inter-tasks communication 
mechanisms. 

The SCOOPP run-time system packs objects when the grain-size is too fine and/or 
when the system load is high. The SCOOPP system takes advantage of the availability 
of granularity information on existing //obj. When parallel tasks (e.g. //obj) are created 
at run-time, it uses this information to decide if a newly created //obj should be used to 
enlarge an existing grain (e.g. locally packed) or originate a new remote grain. 

Packing Method Calls. Method call packing in SCOOPP aims to reduce parallelism 
overheads by packing several method calls into a single message. 

The SCOOPP run-time system packs method calls when the grain-size is too fine. 
On each inter-grains method call, SCOOPP uses granularity information on existing 
objects to decide if the call generates a new message or if it is packed together with 
other method calls into a single message. 

Packing Parallel Objects and Method Calls. The two types of packing complement 
each other to increase the grain-size. They differ in two aspects: (i) method calls can 
not be packed on all applications, since the packing relies on repeated method calls 
between two grains, and may lead to deadlock when calls are delayed for an arbitrary 
long time; this delay arises from the need to fulfil the required number of calls per 
message; (ii) method calls in a message can be more easily unpacked than objects in a 
grain. Reversing object packing usually requires object migration, whereas packs of 
method calls can be sent without waiting for the message to be fully packed. In 
SCOOPP, packs of methods calls are sent either on programmer request or when the 
source grain calls a different method on the same remote grain. 



2.2 Parameters Estimation 

To take the decision to pack, two sets of parameters are considered: those that are 
application independent and those that are application dependent. The former includes 
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the latency of a remote "null-method" call (a) and the inter-node communication 
bandwidth. The later includes the average overhead of the method parameters passing 
(v), the average local method execution time (p), the method fan-out ((|)) (e.g., the 
average number of method calls performed on each object per method execution) and 
the number of grains per node (y). 

Applieation independent parameters are statically evaluated by a kernel application, 
running prior to the application execution; parameters that depend on the application 
are dynamically evaluated during application execution. The next two subsections 
present more details of how these two types of parameters are estimated. 

Application Independent Parameters. Application independent parameters include 
the latency of a remote "null-method" call (a) and the inter-node communication 
bandwidth. Both parameters are defined for a “unloaded” target platform. They are 
estimated through a simple kernel SCOOPP application that creates several //obj on 
remote nodes and performs a method call on each object. 

The remote method call latency (a) is the time required to activate a method call on 
a remote //obj. It is estimated as half the time required to call and remotely exeeute a 
method that has no parameters and only returns a value. 

The inter-node communication bandwidth is estimated by measuring the time 
required to call a method with an arbitrary large parameters size. It is half of the 
division of the parameters size by the time required to execute the method call. 

On some target platforms, these two parameters depend on the pair 
source/destination nodes, namely on the interconnection topology. In such cases, the 
SCOOPP computes the average from the parameters taken between all pairs of nodes. 
Moreover, these parameters tend to increase when the target platform is highly loaded, 
due to network congestion and computational load. However, this effect is taken into 
account on the SCOOPP methodology through the y parameter (number of grains per 
node), which is a measure of the load on each node. 

These two parameters are statically estimated to reduce congestion penalties at 
run-time, since they require inter-node communication, which is one of the main 
sources of parallelism overheads. Their evaluation at run-time, during application 
execution, may introduce a significant performance penalty. 



Application Dependent Parameters. SCOOPP monitors granularity by computing, 
at run-time, the average overhead of the method parameters passing (v), the average 
method execution time (p) and the average method fan-out (c|)). SCOOPP computes 
these parameters, at each object creation, from application data collected during 
run-time. 

The overhead of the method parameters passing (v) is computed from the 
inter-node communication bandwidth multiplied by the average method parameter 
size. This last one is evaluated by recording the number of method calls and adding 
the parameter sizes of each method call. 

The average method execution time (p) is evaluated by reeording the time required 
to perform eaeh local method execution. When a method does not perform other calls, 
this value is just the elapsed time. When a method contains other calls, the 
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measurement is split into the pre-call and after-call phases, and the previous procedure 
is applied to each phase. Moreover, the time required to perform the pre-call phase is 
used has a first estimate of the average method exeeution time, so that the average 
method execution time data is available for the next method call, even if the first 
method exeeution one has not completed yet. 

The average method fan-out (c()) is measured by a global program analysis through 
object and method calls statistics. The run-time system marks each //obj with its depth 
on the object creation tree. The depth of the root object is one and the depth of all 
other //obj is equal to the depth of its creator plus one. The run-time system maintains 
a table for the number of call performed on each depth, which is incremented on each 
local method call. The method fan-out is derived from this table through the overall 
ratio between consecutive depths. 

SCOOPP minimises the run-time impact of the parameters estimation overhead in 
three ways. First, granularity information is collected at class level, e.g., the V, p and (]) 
parameters are measured for each class of parallel objects. This approach is clearly 
less costly than an instance-based approach and more accurate than a global one. 
Second, when the overhead introduced to access the system clock to measure the 
average method execution time is high (usually more than 1%) the frequency of 
information retrieval is reduced; this excludes, however, the application start up phase, 
since on that phase no information is available. Third, the parameters that are 
estimated at run-time do not require inter-nodes communication, since the estimation 
is locally performed and parameters information is only exchanged within requests for 
remote object creation. 



2.3 Packing Policies 

Packing policies define “when” and “how much” to pack, e.g. the number of //obj that 
should be packed in each grain, and the number of method calls to pack on each 
message. These policies are usually grouped according to the structure of the 
application: object pipelines, static object trees (e.g. object farming) and dynamic 
object trees (e.g. work split and merge). The work here presented focus on packing 
policies for pipelined algorithms and the next section evaluates its application to a 
case study, the Eratosthenes sieve. 

Packing Parallel Objects. The decision “when” to pack is taken based on the average 
method execution time (p), the average latency of a remote "null-method" call (a) and 
the overhead of the method parameters passing (v). When the average method 
execution time is excessively short, //obj should be packed, which occurs when the 
overhead of a remote method call is higher than the average method execution time, 
e.g., (a+v)>p. This is the turnover point to pack //obj, where the parallelism overhead 
becomes longer than the time spent on locally “useful work”. 

The decision of “how many” //obj to pack into a single grain (e.g., degree of object 
packing or computation grain-size, Cp) is related to the a, v and p parameters as seen 
before, and also on the method fan-out ((])) and on the system computational load, e.g., 
the number of grains per node (y). The computation grain-size should be increased 
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when the system presents high parallelism overhead (e.g., high a and v) and be 
decreased on high average method execution time. The degree of object packing 
should also be decreased when fan-out increases, since each method call performs 
several intra-grain calls, and it can be increased when the number of grains per node is 
high, to decrease parallelism overheads. 

On pipelined applications, packing adjacent //obj makes the number of intra-grain 
calls equal to the average number of objects in each grain, since the fan-out is close 
to 1 . When Cp //obj are packed together, each remote method call generates Cp method 
calls, executing on CpP time. Under these conditions, the turnover point to decide 
when to pack is reached when Cp =(a+v)/p. This expression defines the minimum 
number of //obj to pack on each grain to overcome the parallelism overheads. To 
decrease parallelism overheads even more, SCOOPP increases the number of //obj on 
each grain linearly with y by using the expression Cp=y(a+v)/p. 

Packing Method Calls. The decision “when” to pack method calls follows the same 
rule as the one applied to pack objects, e.g., when (a+v)>p; this condition reflects that 
the overhead to place a single remote call is higher than the remote method execution 
time. In this case, several inter-grains calls should be packed to reduce communication 
overheads. 

The decision “how many” method calls to pack into a single message (e.g., degree 
of method call packing or communication grain-size. Cm) is computed from the a, v 
and p parameters. Sending a message that packs C^ method calls has a time overhead 
of (a+CmV) and the time to locally execute this pack is C^p. Packing should be 
performed such that (a+CmV)<CmP, e.g., when the overhead to place a remote call is 
lower than the time to locally execute the pack of method calls. Resolving the 
equation gives the turnover point Cm=a/(p-v). 

When the average method execution time is close or smaller than the overhead of 
the parameter passing (e.g., p<=v), method calls should not be packed. However, this 
rule can be relaxed if both method calls and //obj are packed. 

Packing Parallel Objects and Method Calls. SCOOPP can simultaneously pack 
method calls and //obj. However, when method calls are packed, the application 
performance may benefit from a less //obj packing degree. In this case, SCOOPP 
scales down the computation grain-size by using the expression Cp=y(a+C„,v)/(pC„). 

When the overhead of the parameters passing (v) is longer than the average method 
execution time (p), e.g., V>p, the method calls packing factor should be decreased. In 
this case, the method call packing is estimated as C„,=a/v. 

To summarise, on pipelined applications, the p/v ratio is the key to choose between 
object and method calls packing. When v<p the communication packing degree, in 
number of method calls per message is Cm=cx/(p-v), and the object packing degree is 
decreased by the C^ factor, e.g., the number of //obj on each grain is 
Cp=y(a+C„v)/(pC,„). When v>p, the communication packing degree is C„,=a/v and 
the same Cp expression can be used to compute the number of //obj per grain. 
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3 SCOOPP Evaluation with the Eratosthenes Sieve 

The Eratosthenes sieve is an algorithm to compute all prime numbers up to a given 
maximum. The original algorithm is well known; although several faster algorithms 
have been proposed [14][15][16], the original one is still the most adequate to 
illustrate relevant features in parallel algorithms, since a parallel version is intuitively 
obtained from the original sequential algorithm. 

One simple parallel implementation is a pipelined algorithm containing all 
computed prime numbers, where each element filters its multiples. Numbers are sent 
to the pipeline on an increasing order. Each number that gets to the end of the pipeline 
is a prime number and is appended as a new filter. Fig.l presents the sieve processing 
flow for the numbers 3, 4 and 5. 






O 



Parallel task 



Message flow 






Fig. 1. Parallel sieve of Eratosthenes processing numbers 3, 4 and 5 

The Eratosthenes sieve has been chosen to show the relevant features of the 
SCOOPP since it has a totally predictable behaviour, making it adequate to evaluate 
the separate impact on the execution time of each parameter and packing approach. 
Furthermore, it is scalable to large environments, if a large number is selected. 

The Eratosthenes sieve has a large parallelism potential since each element of the 
pipeline (e.g., sieve filter) can be a parallel task (e.g., a parallel object), which 
originates a large number of fine-grained parallel tasks. It dynamically creates parallel 
tasks and their number is dependent of the problem size. Table 1 presents the 
parallelism degree of the sieve of Eratosthenes for several problem sizes. 



Table 1. Parallelism degree of the Eratosthenes sieve several problem sizes 



Problem 

size 


Number of 
parallel tasks 


Number of 
messages 


100 


24 


290 


1.000 


167 


14292 


10.000 


1228 


762862 


100.000 


9591 


46224072 



On a naive implementation of the sieve, each parallel task has a computation to 
communication ratio of one integer division operation per message received, which is 
a too low ratio for the generality of distributed memory machines. A slightly optimised 
sieve was developed to increase this ratio and decrease the sieve sequential workload. 
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which sends blocks of 10 values between sieve filters on a single method call. Each 
sieve filter marks the numbers that it filters and a block is merged with another block 
when it has more 5 values marked. This optimisation decreases the number of 
messages by a factor close to 10 and increases the computation to communication 
ratio to a value close to 10 integer divisions per message received. 

The next subsection discusses how a programmer based static grain-size adaptation 
can increase this ratio. A second subsection shows performance results measured 
using the SCOOPP dynamically grain-size adaptation. Both subsections present 
performance results for an optimised sieve on a problem size of 100 000 values. 



3.1 Programmer Based Grain-Size Adaptation 

This section shows how a programmer can adapt the grain-size of the sieve to improve 
performance on several platforms. It presents the impact of the grain-size choices on 
the number of //tasks and inter-//task messages. Finally, it presents the execution times 
of the sieve for a number of grain-size choices and analyses the impact of grain-size 
choices on the tested three platforms. 

To adapt the grain-size in the sieve algorithm a parallel programmer may merge 
sieve filters into a single parallel object and/or pack several parallel objects into a 
single grain (e.g., a parallel task). Merging filters into a //obj requires some code 
rewrite, while packing //obj into a grain is less demanding: minor code modifications, 
mainly to adapt the load distribution policy to perform a block distribution. Merging 
filters into a //obj removes overheads of intra-grain object creation and method calls, 
leading to lower execution times (e.g., sequential workload). However, it requires 
complex code to support dynamic grain-size modifications. 

Both approaches adapt the computational grain-size, increasing the average number 
of operations per received value on each //task (e.g., //task computation to 
communication ratio) and reducing the overall number of //tasks. On the sieve, this 
number of operations is directly proportional to the number of filters on each //task 
and is hereafter referred to as the //task computation granularity, in number of filters 
per parallel task. However, this increase may not lead to an acceptable performance, 
namely there may be not enough //tasks and the sieve may generate an excessive 
number of messages. Packing several method calls into a single message reduces the 
messages traffic, decreasing the communication overhead. On the optimised sieve 
under study, the number of values per message is tenfold the number of method calls 
per message, since each method call sends a block of 10 values, and is hereafter 
referred to as the inter-//task communication granularity. 

Table 2 presents the number of parallel tasks and inter-tasks messages required to 
compute the prime numbers up to 100 000, for several //task computation and 
communication granularities. The grain-sizes values were selected to show 
representatives values of the sieve execution times. 

Fig.2 presents the sieve execution times as a function of both the computation and 
communication granularities. These figures present the execution times on 4 and 7 
cluster nodes, on 4 and 16 PowerXplorer nodes and on 14 and 56 MultiCluster nodes. 
On these experiments the measured values were obtained by using one sieve filter per 
//obj and grain packing was performed by packing several //obj into a single //task. 
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The Multicluster can not run sieves with grains smaller than 3 sieve filters, due to 
memory space limitations. All graphs are scaled to the sieve execution time on a single 
node. 



Table 2. Sieve parallelism degree for several computation and communication granularities 
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On all these targets platforms the computation granularity has a strong impact on 
the sieve performance: when the computation grain-size is too fine or too large the 
performance penalties are considerably heavy. Too fine grains ean lead to a large 
number of //tasks and the associated overhead costs; too large grains may not use all 
the available processing nodes. 

Communication grains also have an impact on the overall performance: on smaller 
systems, fine grains (short messages) introduce a penalty, since they generate an 
excessive number of messages between pairs of nodes; on large systems, shorter and 
more frequent messages favour load balancing and reduce start-up times. 

These results show how relevant is the right choice for both the computation and 
communication grain-size. However, they also show how time consuming a 
programmer based approach can be due to the dynamic nature of the parallel tasks of 
the sieve; it requires long experimental work (to test a wide range of computation and 
communication grain-sizes) and/or a deep analysis of both the algorithm and target 
platform features. 

3.2 Dynamic Grain-Size Adaptation 

One of the main goals of SCOOPP is dynamic scalability through grain packing. To 
tests de effectiveness of the SCOOPP granularity control, an evaluation of the ability 
to dynamically pack sieves filters was performed. This evaluation compares the lowest 
execution times - experimentally measured in the previous section - with the 
execution times obtained by running the sieve on several target platforms, without any 
code change and relying on the granularity control mechanisms in the SCOOPP 
run-time system. 
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a) 4 X 350 MHz Pentium II (in Cluster) b) 7 x 350 MHz Pentium II (in Cluster) 




c) 4 X 66 MHz PPC 601 (in PowerXplorer) 




d) 16 X 66 MHz PPC 601 (in PowerXplorer) 




e) 14 X 30 MHz T805 (in MultiCluster) f) 56 X 30 MHz T805 (in MultiCluster) 



Fig. 2. Sieve execution times for a programmer based grain-size adaptation 
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Table 3 summarises the measures above mentioned. On the SCOOPP strategy, the 
computation and communication grain-sizes were obtained by computing the number 
of //obj on each //tasks and the number of method calls on each message, according 
the expressions on section 2.3. A circular load balancing strategy spreads grains 
through the nodes and can place several grains per node. 

The P column shows the number of processors used for each test. For both the 
programmer based and SCOOPP grain-size adaptation several parameters are given: T 
is the execution time in seconds; Sp is the speedup obtained comparing with the same 
sieve on a single node; j is the number of grains placed on each node; Cp is the degree 
of the computation packing, in number of //obj (filters) per //task (e.g., the 
computation grain-size) and Cm is the degree of the communication packing, in values 
per message (e.g., the communication grain-size). C^ is tenfold the number of method 
calls per message, since each method call sends a block of 10 values. On the SCOOPP 
methodology Cp and C„i are mean values, since they are computed dynamically and 
change during run-time. 

The SCOOPP methodology results also include 3 columns with the estimated 
parameters, in microseconds: the remote method call latency (a), the overhead of the 
method parameters passing (v) and the average method execution time (|u). The latter 
two parameters are also mean values. 



Table 3. Comparing sieve execution times: programmer based and SCOOPP 
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These results show the effectiveness of the SCOOPP methodology to scale the 
sieve application on several target platforms. The methodology was able to 
dynamically increase grain-sizes to obtain speedups of the same order of magnitude as 
a programmer-based approach. Moreover, execution times obtained through the 
SCOOPP methodology are often in a 20% range of the optimal values, showing that 
this methodology successively removes most of the parallelism overheads. The 
remaining overhead is usually due to a choice of a too large or too small number of 
grains. JTowever, removing this overhead requires the knowledge of the full number of 
//task or some guessing through experiments, as the ones performed on the previous 
section. These alternatives increase development costs and are not feasible on 
applications where the number of //tasks is strongly dependent on input data. 

When computation and communication grain-sizes are controlled through packing 
(both object and method calls packing) the total number of created objects and method 
calls remains the same. To reduce this sequential workload - due to the object oriented 
paradigm - the programmer can “pack” by merging several //obj into a single //obj 
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(e.g., pack several filters into a single //obj) and by grouping blocks of values on a 
single method call. Fig.Sa and 3b show the impact of merging several filters into a 
single //obj and increasing the block size on method calls. The graphs show execution 
times on a single cluster node and the ideal execution time on 4 cluster nodes, for 
several computation and communication grain-sizes. 




1 10 100 1000 10000 

Computation grain-size (filters per //obj) 
a) "Packing" on 1 cluster node 
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b) Ideal "packing" on 4 cluster nodes 



Fig. 3. Execution times for partially optimised sieves through method call and object merging 

When these optimisation approaches are followed to supply SCOOPP with 
pre-optimised parallel versions, SCOOPP is also able to improve the overall 
performance. Fig. 4a presents the times obtained on programmer partially optimised 
sieves, with communication grain-sizes of 10 and 100 values per message; Fig.4b 
shows their behaviour on the SCOOPP system. 
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1 10 100 1000 10000 

Computation grain-size (filters per //obj) 
a) Programmer based on 4 cluster nodes 




b) SCOOPP on 4 cluster nodes 



Fig. 4. Programmer based and SCOOPP implementation on partially optimised sieves 

These execution times show that the SCOOPP values are very close to the “ideal” 
ones in Fig.3b. These results reinforce the effectiveness of the SCOOPP methodology, 
showing that SCOOPP can efficiently scale pipelined applications, even when these 
are previous and partially optimised by the programmer. 
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4 Conclusion 

The commercial success of massively parallel systems was slowed down mainly due to 
the lack of adequate tools to support automatic mapping of the applications into 
distinct target platforms, without significative loss of efficiency. The overhead on 
programmers was too high and the available tools were inefficient. SCOOPP attempts 
to overcome these limitations: it provides dynamic and efficient scalability of object 
oriented parallel applications across several target platforms, packing grains and 
messages, without any code modification. 

The presented results show the effectiveness of the SCOOPP methodology when 
applied to pipelined applications on several target platforms. The methodology is able 
to dynamically increase grain-sizes and to obtain speedups of the same order of 
magnitude as a programmer-based approach. Moreover, execution times obtained 
through the SCOOPP methodology are often in a 20% range of the optimal values, 
showing that this methodology successively removes most of the parallelism 
overheads. 

Programmer based grain-size adaptation is not a competitive alternative to 
SCOOPP, it requires a wide range of tests on each target platform and each test is 
highly time consuming (as presented in Fig. 2). 

The performance penalties imposed by SCOOPP have a low impact on application 
execution time, and they are mainly due to the run-time requirements to estimate the 
application dependent parameters to adapt the computation and communication 
grain-sizes. A static adaptation can provide the correct grain-size at the beginning of 
the running, but a dynamic strategy requires some time to evaluate the application 
features and to react accordingly. 

Dynamic scalability of the parallel code version largely overcomes this small 
performance cost. It is the most promising approach to scale applications where task 
granularity is strongly dependent on input data. When compile time estimates of task 
granularity are not accurate, it may decrease the cost of the parallel code development 
and improve the code reutilization on multiple target platforms. 

Current work includes development of packing policies for static and dynamic 
object trees, and applied to less controlled application environments (such as 
computer vision applications). 
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Abstract. A number of interesting properties for scheduling and/or cost 
estimation arise when using parallel programming models that restrict 
the topology of a program’s task graph to an SP (series-parallel) form. A 
critical question however, is to what extent the ability to exploit paral- 
lelism is compromised when only SP coordination structures are allowed. 

This paper presents new application parameters which are key factors to 
predict this loss of parallelism at both language modeling and program 
execution levels, for shared-memory architectures. Our results indicate 
that a wide range of parallel computations can be expressed using a 
structured coordination model with a loss of parallelism that is small 
and predictable. 

1 Introduction 

In high-performance computing currently the only programming methods that 
are typically used to deliver the huge potential of high-performance parallel 
machines are methods that rely on the use of either the data-parallel (vector) 
programming model or simply the native message-passing model. Given current 
compiler technology, unfortunately, these programming models still expose the 
high sensitivity of machine performance on programming decisions made by 
the user. As a result, the user is still confronted with complex optimization 
issues such as computation vectorization, communication pipelining, and, most 
notably, code and data partitioning. Gonsequently, a program, once mapped 
to a particular target machine is far from portable unless one accepts a high 
probability of dramatic performance loss. 

It is well-known that the use of structured parallel programming models of 
which the associated DAG has series-parallel structure (SP), has a number of 
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advantages [16], in particular with respect to cost estimation [14, 4, 15], schedul- 
ing [3, 1], and last but not least, ease of programming itself. Examples of SP 
programming are clearly found throughout the vector processing domain, as well 
as in the parallel programming domain, such as in the Bird-Meertens Formal- 
ism [15], SCL [2], BSP [18], NestStep [10], SPC [4], OpenMP [12] ^ 

Despite the obvious advantages of SP programming models, however, a crit- 
ical question is to what extent the ability to express parallelism is sacrificed 
by restricting parallelism to SP form. Note that expressing a typically non-SP 
(NSP) DAG corresponding to the original parallel computation in terms of an SP 
form essentially involves adding synchronization arcs in order to obey all existing 
precedence relations, thus possibly increasing the critical path of the DAG. For 
instance, consider a macro-pipeline computation of which the associated NSP 
DAG is shown at the left of Fig. 1, representing a programming solution accord- 
ing to an explicit synchronization or message-passing programming model. The 
figure on the right represents an SP programming solution, in which a full bar- 
rier is added between every computational step. While for a normally balanced 
workload situation the SP solution has an execution time similar to the NSP 
one, in the pathological case where only the black nodes have a delay value of 
T while the white ones have 0, the critical path of the SP solution has highly 
increased. Due the high improbability of such workload situations in a normal 
computation, SP solutions are generally accepted in vector/parallel processing. 
They are an easy and understandable way to program, useful for data-parallel 
synchronization structures as well as many other task-parallel applications, and 
provide portability to any shared or distributed-memory system. 





Fig. 1. NSP and SP solutions for a macro-pipeline 



(Inner part) 



Let 7 > 1 denote the ratio between the critical path of the DAG associ- 
ated with the original (NSP) algorithm and the DAG of the closest SP program 
approximation, yet without exploiting any knowledge on individual task work- 
loads (i.e., only topology is known). Recently, empirical and theoretical results 
have been presented that show that 7 is typically limited to a factor of 2 in 

^ Although OpenMP directives are oriented to SP programming, it provides a library 
for variable blocking that can be used to produce NSP synchronizations. 
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practice [4, 5, 7]. In addition, empirical evidence [7] has been presented that for 
a wide range of parallel applications, especially those within the data parallel 
(vector) model, 7 is strongly determined by simple characteristics of the problem 
DAG pertaining to topology and workload. 

Let r denote the ratio between the actual execution times of both solutions 
when implemented and optimized in a real machine. While 7 > 1 at program 
level, the actual performance loss (P) as measured at machine level will be 
positively influenced by the SP programming model as mentioned earlier (See 
Figure 2). Thus, the initial performance loss when choosing an SP-structured 
parallel programming model may well be compensated or even outweighed by 
the potential gains in portability and in performance through superior scheduling 
quality (in terms of both cost and performance). 



Problem definition 




Problems Space 



Algorithm Level 



Program Level 



Machine Level 



Fig. 2. Measures of performance loss at different levels of abstraction. 



In this paper we present the results of a study into the properties of 7 and, 
in particular, P. More specifically, 

— we extend our earlier study on 7 , which was primarily based on synthetic 
DAGs, with new results for real parallel applications, which confirm the 
applicability of simple algorithm metrics (such as application scalability, 
expected number of iterations, synchronization complexity), as prediction 
parameters for 7 . 

— we present the results of a study on the relationship between P and 7 
based on the implementation of representative applications on two different 
shared-memory architectures: GG-NUMA (0rigin2000) and vector machine 
(CrayJ90), using different parallel language models (OpenMP or native par- 
allel compiler-directives and message passing, in SP and NSP versions). 

This paper is organized as follows. In Section 2 we present a model to mea- 
sure the loss of parallelism at program abstraction level ( 7 ). Section 3 contains 
an overview of the program parameters that determine 7 , with some supporting 
theoretically derived formulae. Section 4 introduces the experiments design to 
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measure the loss of parallelism in algorithms and applications implemented on 
real machines. The results obtained for representative algorithms when measur- 
ing r are explained in Section 5. 



2 Program-Level Model 

The decision to use an NSP or an SP language is taken at the program design 
abstraction level. At this point, the programmer is not concerned about the cost 
of the parallelization mechanics (communications, creation and destruction of 
tasks, mutual exclusion). He uses a parallel language or library to express the 
algorithm he designed, taking advantage of the semantics to exploit the inher- 
ent parallelism of the problem. Although SP-restricted languages are easier to 
understand and program, being constrained to SP structures, some of the paral- 
lelism could be lost due to added synchronizations. At this level we investigate to 
what extent one can expect high losses to appear, and what parameters related 
to the algorithm and workload distribution are responsible for this loss. 

For our model of programs we use AoN DAGs {Activity on Nodes), denoted 
by G = (y, E), to represent the set of tasks {V) and dependencies (E) associated 
with a program when run with a specific input-data size. Each node represents 
a task and has an associated load or delay value representing its execution time. 
At this level, edges represent only dependencies and have no delay value. If 
required, communication delays should be included in terms of their own specific 
tasks. SP DAGs are a subset of tasks graphs which have only series and parallel 
structures, which are constructed by recursively applying Fork/Join and/or series 
compositions [5]. 

Let W '.V ^ R denote the workload distribution of the tasks. Given G and 
W, we define G(G) {Critical path or Cost) to be the maximum accumulated 
delay over all full paths of the graph. 

A technique to transform an NSP graph to SP structure without violating 
original precedence relations is called an SP-ization technique. It is a graph 
transformation {T : G G') where G is an NSP graph and G' has SP form, and 
all the dependencies expressed in G are directly or transitively expressed in G' . 
Due to the new dependencies introduced by the SP-ization process, the critical 
path may be increased. Let 7 denote the relative increment in the critical path 
generated by a given T, and in general by the best possible T, given by 

C{G') . . , 

^ 7i{y ’ ^ rmn{jT) 

respectively. Clearly, 7 is a function of DAG topology and workload W. For 
a given W, there exists a transformation T that minimizes 7 t. In a typical 
programming scenario, however, the exact W is either not known or highly data- 
dependent. Thus, it cannot be exploited in determining the optimal SP program. 
As a consequence 7 t and the mean value 7 t = E(7t) for any possible W are the 
most interesting magnitudes to study. Although there does not exist a generic 
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optimal SP-ization for any G and W, a conjecture was introduced previously 
that 



VG, 3T : 7 t < 2, 



except for extremely unlikely (pathological) values of W [4] . 

Having no real W information, a fair assumption is to use independent, iden- 
tically distributed (i.i.d.) task loads. This is specially suitable for huge regular 
problems and topologies, and accurate enough for fine or medium grain paral- 
lelism in general. 

Let G = (V, E) be a DAG, and t £ V a task. A number of important DAG 
properties are listed below: 



Task in-degree 
Task out-degree 
Task depth level 
Graph layers 
Graph depth 

Graph parallelism 
Synchronization density 



i(t) = |{(T,t)GA}| 

0{t) = mt')eE}\ 

d{t) = 1-1- max{d{t')) : (t', t) £ E 

L(G) = {l:lCV-,t,t' d{t) = d{t')} 

D{G) = max{d{t)) 

P{G) = rnax{\l\) 

leL(G) 

S{G) = A E 
tev 



3 Influence of Task Graph Parameters 

Previous empirical studies [5] with random i.i.d. W identified specific structured 
topologies that present worse 7 values than random unstructured topologies with 
the same number of nodes. The inner part of pipelines (see Fig. 1) and cellular 
automata showed this behaviour and are so candidates for our research on the 
topological factors that are responsible of the loss of parallelism because of SP- 
ization. Other interesting algorithms are also included: LU reductions, Cholesky 
factorizations, FFT, and several synthetic topologies. 

Graphs generated from these algorithms, for different input-data sizes, have 
been constructed at the program level. An estimation of 7 has been derived 
for them using random task loads and simple SP-ization techniques [6], them, 
in order to study the effect of several graph parameters that appeared to be 
relevant. 

Scalability (or number of processors), and number of iterations are repre- 
sented by the graph size parameters: Maximum degree of parallelism (P) and 
Depth level (D). The effect of P and D, measured in several representative 
applications, is shown to be under-logarithmic on 7, and bounded by simple 
functions [7]. Fig. 3 shows the 7 curves generated for a ID cellular automata 
when only one parameter is varied. The contribution of D to the loss of par- 
allelism is limited when D > P. This is true for all but one (macro-pipeline) 
considered topologies [7]. 

Furthermore, the inherent synchronization activity of an algorithm, repre- 
sented by the Synchronization Density parameter {S), bounds the 7 growth 
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Fig. 3. Effect of P and D - 2D Cellular Automata 



even for small values. In Fig. 'i(a) is shown how the increment of the number 
of edges in a synthetic regular topology quickly diminishes the possible loss of 
parallelism. On the other hand, small values of S' (S' < 2) indicate the presence 
of series of tasks. SP-ization techniques that take advantage of these structures 
lead to great decreases of 7 . Fig. A(h) present this effect on a fine grain paral- 
lelization of a Cholesky factorization. 7 values reflect the behaviour of the small 
and variable S parameter, which in this case depends on the input-data size, 
growing for small P values and decreasing afterwards. 



Random Edges Grid (D=P=100) Cholesky Factorization ( S<2 ) 





S (Synchronization Density) Matrix dimension (P) 

Fig. 4. Effect of P and D - 2D Cellular Automata 



Some theoretically derived formulae, using coarse SP approximations of the 
original DAGs and order statistics, support all these results [17]. As an example 
we provide the semi-analytical formula for 7 in the special problem of macro- 
pipeline. 

Let P = D and let W be modeled by an i.i.d. delay per node according to a 
Gaussian(/r, a) distribution. The critical path (Csp) of SP graphs can be derived 
using a well-known approximation of the cost of a P-node parallel section from 
order statistics [9], given by 
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Cp = ii + ay/2\og{0AP) 



It follows 



CsP = D{fi + aV21og(0.4P)) 

Since generally the cost estimation of NSP stochastic DAGs is analytically in- 
tractable, we approximate the NSP DAGs by SP DAGs that capture the main 
inner features of the original. It appears that the parallelism P' of the SP DAG 
approximation is directly related to S and P of the original NSP DAG, accord- 
ing to P' = S' -I- log(P/2) for cellular automata and pipeline topologies. (For ID 
cellular automata S = 3, for pipeline S = 2 [6]). Cnsp is then approximated 
within 10% error [6]. 

Subsequently applying the SP DAG critical path approximation from order 
statistics yields 

Cnsp = + aV21og(0.4(S + log(P/2)))) 

Gonsequently 

- _ ^SP ^ P(/x + trv^21og(0.4P)) 

^ Cnsp ^ D{n + a sj 2\og{Q A{S P\og{P/ 2)))) 

This formula agrees with our experiments within 25% [17]. A coarse, but 
meaningful simplification of the formula for (typically) large P is given by 

_ _ + O' yiog(-p) 

^ 11 + ay^\og{S) 

Indeed, the asymptotic influence of P is clearly logarithmic, while the effect 
of S is exactly inverse, which is in agreement with the results. Also the effect of 
the workload distribution is in agreement with our measurements (considering 
the typical case where P ^ S). The important parameter is the relation between 
the deviation and mean of the task loads (Dev = <j / jj). 

4 Program Execution Level 

At implementation level a parallel program is compiled and optimized for an spe- 
cific machine. When executed, it uses costly mechanisms to spawn, synchronize 
and communicate tasks. At this level the underlying architecture of the machine 
becomes important. 

The exact cost added to the real execution time is highly dependent on the im- 
plementation, and for a particular machine, on the parallel mechanisms provided 
at the language level. The advantages of the architecture must be exploited. 

An SP version of a program typically needs to add dependencies, and new 
delays inherent to the more complex synchronization and communication scheme 
may increase execution times. The cost of any parallel mechanism is different 
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for each architecture. Nevertheless, the better data partitioning and scheduling 
techniques, only possible when SP-restricted programming is used, can minimize 
the communication needs and compensate this effect. 

In this study we focus on shared-memory architectures. The programming 
techniques used in these machines are straightforward, and the programmer is 
not normally facing the data distribution or scheduling details directly. 

Our study is focused on two basic shared-memory architectures: CC-NUMA 
(0rigin2000) and Vector-machine (CrayJQO). Both have representative proper- 
ties for performance evaluation of synchronization techniques. CrayJQO has a 
non-hierarchical memory structure. Thus, synchronizations and data access have 
more predictable delay times. Automatic optimizations deployed by the compiler 
are mainly oriented to the efficient use of the vector processing units and coarse 
grain parallelization. 0rigin2000 is a CC-NUMA machine. The use of memory 
hierarchy improves performance, while cache-coherence protocols and automatic 
process migration try to hide machine level details to the programmer. Nev- 
ertheless, the efficient use of memory locality is not an easy task even with 
compiler assistance. Delay times for data access and synchronizations are less 
stable, specially when full communications or barriers are used across the whole 
system. 

Experiments: For the experiments presented in this paper we have chosen 
three of the most representative and easy to program algorithms studied at the 
previous level: 

— 2D Cellular automata (1750x1750 grid, 1750 iterations) 

— Pipeline (Inner part, see Fig. 1) (30000 cell vector, 30000 iterations) 

~ LU reduction (1750x1750 matrix) 

The implementation of the algorithms has been straightforward using docu- 
mentation examples and text books. The following programming models: 

— MPI, NSP version (Cray and Origin2000) 

— MPI with added Barriers, SP version (Cray and Origin2000) 

— OpenMP directives, SP version (Origin2000) 

— OpenMP variable blocking, NSP version (Origin2000) 

For performance comparisons, our reference model is an MPI implementation 
programmed with point to point communications. The second model is generated 
adding barriers to the first MPI code, to transform it to SP form. This lets us 
compare the real effect of a direct transformation from NSP code to SP, using the 
same parallel tool implementation. In most examples two versions with OpenMP 
are presented. The first one uses simple OpenMP directives, mainly parallel 
loops, sections and barriers to produce an SP program. In most cases the use 
of barriers has been intentionally included for comparison with the MPI barrier 
version. The second one, not developed in every example, is a complicated NSP 
code where every synchronization is implemented through blocking variables, 
used as semaphores in the most efficient way. 
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Let Tmpi be the execution time of the NSP version with MPI point to point 
communication. Let Tmodei-i be the execution time of any of the other versions. 
Then 



^model — i 



Tmodel—i 



Tmpi 



We avoid compiler aggressive optimization except in the last version. Com- 
piler code manipulation (mainly loop reordering and unrolls) could change the 
synchronization patterns. Cray versions are compiled with vector optimizations 
enabled. Specific SP data partitioning and scheduling techniques have not been 
exploited; we rely on the machine native system. 

Although in the programming abstract level we were assuming i.i.d. work- 
loads, we found that the only important parameter of the workload distribution 
is the deviation- mean relation of the task loads {Dev = <y / jT). Our simple kernel 
programs split the load of the fixed size input-data between the processors, lead- 
ing to different means when different numbers of processors are used. However, 
the deviation-mean ratio {Dev) changes are small enough to compare the re- 
sults, introducing only minimal errors. New experiments with scaled up problem 
sizes that keep the workload mean nearly constant have been carried out, giving 
similar results [13]. 

Experimental measures include the total execution time of the parallel section 
of each code, as well as the mean and deviation of task times. We consider a 
task to be a continuous serial computation, from the point after a wait for 
synchronization is issued (one or more communication receptions, blockings or 
barriers) to the next one. Experiments with 2,4,6 and 8 processors were made. 
Results, codes and tools are available in [13]. 



5 Machine Level Results 

In this section we present the results of the machine level experiments. Compar- 
isons with the 7 predictions obtained for these problems with the program level 
model are discussed. 



Highly Regular Computations: As expected, the task loads of Cellular Au- 
tomata and Pipeline problems present a minimum deviation in both machines. 

Cellular Automata: a/fi G (0.01,0.06) 

Pipeline: cr//i G (10“®, 10“^) 

In Pig. 5, Pig. 6 and Pig. 7 is perfectly clear that SP versions (specially 
the MPI-SP one) incurs in a negligible increase of the execution time compared 
to the MPI reference model. The small slope of the D curves complies with 7 
predictions of our program level model, although the values are slightly higher 
due to extra communication costs. 
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CrayJ90 - Cellular Automata (1750x1750) 1750 iterations 




CPUs 



CrayJ90 - Cellular Automata (1750x1750) 1750 Iterations 




Fig. 5. CrayJQO - 2D Cellular Automata 



Origir2000 - Cellular Automata (1750x1750) 1750 iterations Origin2000 - Cellular Automata (1750x1750) 1750 Iterations 





CPUs CPUs 

Fig. 6. Origin2000 - 2D Cellular Automata 



At every iteration, each barrier generates fixed increment of the execution 
time, apart from the loss of parallelism. It is remarkable that the Origin2000 MPI 
barrier implementation is almost perfectly efficient, while the OpenMP system 
adds significantly cost to the execution time [8]. More efficient nested fork/join 
techniques in the Origin2000 are being researched [11]. 

As shown in Fig. 6, the OpenMP-SP program generated for the cellular 
automata problem is rather inefficient due to poor serial code optimization. Al- 
though another improved version has been implemented that shows much better 
results, it is interesting to notice the similarities between the high F curve and 
the predictions from the program level model. OpenMP-NSP version, aggres- 
sively optimized, produces smaller task delays and therefore lower execution 
time. Nevertheless, F results are comparable with the MPI reference model as 
the synchronization pattern has not been changed. This reveals that even if F 
could be highly affected by serial programming practices and optimization, the 
effect of SP-synchronization is preserved. 

LU Decomposition: LU reduction programs typically distribute the matrix 
rows to the processors, synchronizing after they have been updated. This scheme 
does not exploit fine grain parallelism inherent to each element update, since the 
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0rigin2000 - MPSynch Pipeline (30000) 30000 iterations 



0rigin2000 - MPSynch Pipeline (30000) 30000 Iterations 





CPUs 



CPUs 



Fig. 7. Origin2000 - Pipeline (inner part) 



amount of work for each task would be too small, and communication costs would 
overcome the computation. 

Medium-coarse grain parallelism provides higher deviations, as the tasks do 
not run similar length computations. In this case, processing only the triangular 
part of the matrix, the number of elements which are updated in each row are 
different in each iteration. This effect is slightly reduced in the MPI implemen- 
tation through row interleaving. Anyway, deviations are still high: 

LU reduction: £ (1.5, 1.7) 

r curve obtained in the CrayJQO is compared in Fig. 8 with 7 predictions 
for highly unbalanced situations. High P values are not still fully explained, 
although this increase might be well up to the variable cost of a barrier when 
the number of processors grows (see Table 1). 



NPROC 


CrayJ90 


Origin2000 


2 


.002 


.00005 


4 


.007 


.00049 


6 


.011 


.00053 


8 


.016 


.00083 



Table 1. Barrier cost estimation (sec.) 



Adding the predicted effect of 7 to the cost of 1750 barriers (one for each 
iteration) a closer approximation to real execution times is obtained. This effect 
is less noticeable in programs with higher task loads like cellular automata (see 
Fig. 5) and in the Origin2000, where the barrier cost, compared with the total 
execution time, is much less (see Fig. 9). For Origin2000 it is only detectable for 
applications with very low task loads, such as pipelines (see Fig. 7). 
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CrayJ90 - LU reduction (1750x1750) 



CrayJ90 - LU reduction (1750x1750) 





Fig. 8. CrayJ90 - LU reduction 



0rigin2000 - LU reduction (1750x1750) Origin2000 - LU reduction (1750x1750) 





CPUs 



CPUs 



Fig. 9. Origin2000 - LU reduction 



Efficient SP Programming: For the LU reduction algorithm the OpenMP- 
NSP version is quite complicated and has not been implemented. Instead, in 
Fig. 9 we show the results obtained with a manually parallelized OpenMP ver- 
sion, based on SP parallel loops. Automatic optimizations have been applied. An 
efficient SP version produces good results in F terms, with no relative increase 
when compared with the MPI reference model. 



Iterations: Algorithms have been run for a range of small number of iterations. 
Results do agree with the program level predictions. In Fig. 10 is shown how 
in a cellular automata, for a fixed number of processors, the execution time 
grows with a given ratio for each programming model. Consequently, F is not 
dependent on the number of iterations D (provided D > P as predicted on the 
program level). 



6 Conclusion 

In this paper we present a study on the relationship between the loss of par- 
allelism inherent to SP-restricted programming and the real performance loss 
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0rigin2000 - Cellular Automata (1750x1750) 8 CPUs 0rigin2000 - Cellular Automata (1750x1750) 8 CPUs 




Fig. 10. Origin2000 - 2D Cellular Automata - Iterations 



as measured at the machine execution level. Results point out that the difficult 
task of mapping a code to a real parallel machine is indeed much easier with a 
restricted SP language model, while the actual performance loss (F) due to the 
lack of expressiveness is only small and predictably related to the properties of 
the algorithm ( 7 ). Bad serial programming practices, parallelization, data distri- 
bution, or the implementation of communication-synchronization libraries and 
tools might have higher impacts on performance than 7 . 

Optimized SP techniques for scheduling or data partitioning have not been 
exploited, which would have given better results than presented in this paper. 

To the best of our knowledge such a comparative study between NSP and 
approximated SP implementations of real applications has not been previously 
performed. 

The significance of the above results is that the optimizability and portability 
benehts of efficient cost estimation in the design and/or compilation path can 
indeed outweigh the initial performance sacrifice when choosing a structured 
programming model. 

Future work includes a further study into more irregular, data-dependent or 
dynamic algorithms and applications, to determine DAG properties to accurately 
measure both 7 and F, as well as exploration of the benefits of specific SP 
programming environments for both, shared-memory and distributed-memory 
architectures. 
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Abstract. A neural network based tool has been developed to assist in the 
process of code transformation. The tool offers advice on appropriate 
transformations within a knowledge-driven, semi-automatic parallelisation 
environment. We have identified the essential characteristics of codes relevant 
to loop transformations. A Kohonen network is used to discover structure in the 
characterised codes thus revealing new knowledge that may be brought to bear 
on the mapping between codes and transformations or transformation 
sequences. A transform selector based on this process has been developed and 
successfully applied to the parallelisation of sequential codes. 



1 Introduction 

Over the past decade there has been a dramatic increase in the range of different 
multiprocessor systems available in the marketplace ranging from high-cost 
supercomputers, such as the Cray T3D, to low-cost workstation clusters. It is fair to 
say that the low-cost architectures have proven attractive to the majority of potential 
scientific and engineering users, particularly for applications that need to exploit the 
raw power now available. However, their appeal is often tempered by deficiencies in 
the attendant program development environments. Indeed, these deficiencies are 
evident in all multiprocessor development environments. It may be pejorative, but 
nonetheless accurate, to characterise the majority of typical users of multiprocessor 
systems as unskilled in the arts of parallelisation. Typically, they will be confirmed in 
the use of essentially sequential languages such as Fortran and have neither the time 
nor the desire to understand the intricacies of parallel development techniques or the 
peculiarities of target architectures in their endless search for enhanced performance. 
For such users there is a compelling need for the development of environments which 
minimise user involvement with such complications. In short, if novice users are to 
realise the potential of multiprocessor systems then much of the expert knowledge 
required to develop parallel code on multiprocessor architectures must be provided 
within the development environment. 

The ideal solution is to provide a fully automated solution in which a user can 
simply input a sequential program to the system, be it migrated or newly developed, 
and receive as output an efficient parallel equivalent. Automatic parallelisation scores 
J.M.L.M. Palma etal. (Eds.): VECPAR2000, LNCS 1981, pp. 142-153,2001. 
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high on expression, as programmers are able to use conventional languages, but is 
problematic in that it requires inherently complex issues such as data dependence 
analysis, parallel program design, data distribution and load balancing issues to be 
addressed. Existing parallelisation systems adopt a range of techniques in an effort to 
minimise or eliminate the complexity inherent in the fully automated approach. 
Almost invariably the burden of providing the necessary guidance and expertise 
lacking in the system falls back on the user. Indeed, existing systems may be 
classified by the extent to which user interaction is required in the process of code 
parallelisation. At one end of the spectrum is the purely language based approach in 
which the user is entirely responsible for determining how parallelism is to be 
achieved by annotating the code with appropriate compiler directives. At the other 
end of the spectrum is the goal of a fully automatic parallelisation environment, 
independent of application domain and requiring no user guidance. Between these 
extremes lie a number of environments which permit the user to interact with the 
system during the code development process. These environments may offer differing 
degrees of guidance and interaction in, for example, selecting appropriate program 
transformations or deciding on a particular data partitioning scheme. 

Recently attention has focused on the use of knowledge bases and expert systems 
as a means of compensating for lack of user expertise. Such approaches have 
achieved a degree of success but to date have taken a rather narrow view of the field 
of knowledge engineering. Alternative methods of extracting and representing 
knowledge directly from the code itself have not been widely applied. Genetic 
programming techniques applied to program restructuring have been explored by 
Ryan et. al. [1] with notable success while neural networks have been variously 
applied to load balancing and data distribution [2, 3]. It is our belief that if a system is 
to be developed capable of offering quality strategic guidance for parallelisation then 
it must be underpinned with an appropriate knowledge model in which expert 
knowledge and information implicit in the code itself may be captured and 
synthesised within a coherent framework [4]. 

The research reported here complements and extends previous work in developing 
an integrated software development and migration environment for sequential 
programmers. The result of this work, KATT, (Knowledge Assisted Transformation 
Tools) has been reported elsewhere [5, 6, 7]. KATT began by employing expert 
systems only. As various neural network based tools are developed they have been 
integrated into the environment in line with the underlying knowledge model [8]. This 
paper reports the development of one of these neural-based components, a transform 
selector, and reports results obtained from its use with real codes. 



2 The KATT Environment 

Architecturally, KATT may be considered to consist of three main modules as shown 
in figure 1. These are; 

• The input handler - responsible for converting input codes to an intermediate, 
language independent, graphical representation. 
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• The transformation module - which has access to a suite of correctness preserving 
graph manipulation routines used to restructure the graphieal representation of the 
code produced by the input handler. The result is a new, functionally equivalent 
version of the program that is more amenable to parallelisation, i.e., with 
dependeneies removed or reduced. 

• The output handler - responsible for generating actual parallel code, based on the 
modified graph and for evaluating its performance on the target architecture. 




Source 


Lexical & Syntactic 


Graph 


Graph 


Reordered 


Code profiling 


Parallel Program 


Code 


Analysis and Graph 
Production 


representation of 
user program 


transformations 
standard & 
optional 


graph 


and generation 






Input Handler Tools 



Transformation Tools 



Code Generation and 
Evaluation Tools 



Fig. 1. An architectural schematic of the KATT system showing the neural network based 
transform selector. 



KATT employs a source-to-source restructuring model. The explicit knowledge 
available within the environment is provided by two expert systems; one to aid the 
user in selecting appropriate code transformations, the other to advise on the best 
distribution of code and data on the target architecture. 

Neural networks however offer a complementary means of accessing an alternative 
source of knowledge relevant to parallelisation. By extracting domain knowledge 
implicit in the code itself neural networks can provide an alternative low-level, signal- 
based, view of the parallelisation problem in contrast to the high-level, symbolic view 
offered by expert systems. Combining both paradigms, as illustrated in figure 1, 
within a coherent knowledge model will improve the ability of KATT to offer 
strategic intelligent guidance to the user through access to a broader and deeper 
knowledge model. 



3 Development of the Transform Selector 

We use an SPMD model that concentrates on detecting and realising potential 
parallelism in computationally intensive sequential code loops. To do this requires 
that any loop-carried dependencies, the principal inhibitors to parallelisation, can be 
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detected and removed by a suitable transformation. What is required is a tool capable 
of taking an input code loop and recommending an appropriate transformation, or 
transformation sequence, that will reduce or eliminate any dependencies in the input 
code. 



3.1 Dependencies and Their Removal 

Before considering the design of the neural network based transform selector it may 
be instructive to review the importance of loop-carried data dependencies as inhibitors 
to the concurrent execution of code and how transformations may be used to remove 
them. 

To illustrate data dependence, let SI and S2 be statements and M a memory 
location. The four types of data dependence scenario that may arise are shown in 
Table 1. 



Table 1. Types of Data Dependence 



Data dependence 


Action 


Notation 


Flow or true 


SI writes M then S2 reads M 


Sl5f S2 


Anti 


S 1 reads M then S2 writes M 


S15, S2 


Output 


S 1 writes M then S2 writes M 


S15„ S2 


Input 


SI reads M then S2 reads M 


S15, S2 



If there are no data dependencies present in a section of code then it may be executed 
in a non-deterministic fashion. Loop-carried dependencies are data dependencies that 
exist across iterations of a loop. This type of dependenee effectively controls when 
and where concurrency can be introduced in a loop. A loop can be executed in 
parallel without synchronization only if all the dependencies are loop-independent. 
The approach taken in automatic parallelisation is to determine if dependencies exist 
within a loop and, if possible, remove or reduce them by applying a sequence of 
transformations to the loop. There are over 40 well-known loop transformations [9], 
many of which re-order statements based on dependence analysis. 

For example, consider the code fragment shown in figure 2 containing an anti- 
dependence on the variable A. Because of this loop-carried dependence any two (or 
more) iterations of the loop cannot be executed concurrently. Applying the loop 
distribution transformation will split the original loop into two separate loops, as 
shown in figure 2, in order to break the loop-carried data dependency. This results in 
the formation of two independent loops that may each execute concurrently. 
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Fig. 2. Application of the loop distribution transformation to remove a loop-carried anti- 
dependence on the variable A. 



It is important to note that application of a transformation must result in code that is 
both correct and functionally equivalent to the original. To ensure that this is the case 
legality tests are performed and must be passed before the transformation is permitted. 
Ideally, the transformed code should have fewer dependencies than the original. This 
is not guaranteed and is often not the case. Frequently, application of a transformation 
in an attempt to eliminate a given dependency can result in the introduction of new 
dependencies. 



3.2 Choice of the Neural Paradigm 

A number of network architectures were considered as potential solutions. One 
immediate problem that must be addressed is the nature of the data available to train 
the network. While there is an abundance of code loops available for input data there 
is no corresponding source of output data where suitable transformations, or 
transformation sequences, have been identified for each input code loop. In general, it 
requires an expert to specify the most appropriate transformations for a given input 
code. Casting the problem as one requiring supervised learning (e.g., a multilayer 
perceptron), where both the input derived from the code and the output 
transformations are known, would inevitably result in a network which encapsulated 
the opinion of a given expert. As agreement among experts is rare in the field of 
parallelisation the value of such a trained network would be questionable. 

Selecting the "best" sequence of loop transformations can also be viewed as an 
optimisation problem. Here the problem is formulated as one of finding the optimum 
sequence of transformations that satisfy the dependence constraints of the code loop 
under consideration while minimising execution time (e.g., a Hopfield network or a 
Boltzmann machine). However, mapping the dependencies to network weights and 
interpreting the eventual solution makes this approach difficult from the 
representational point of view. In addition, the neiual paradigms available for 
optimisation problems suffer from a number of serious disadvantages in the context of 
this application. The Boltzmann machine, for example, can take a relatively long time 
to reach a solution and is therefore inappropriate as a component in an online system 
while the Hopfield network can give unexplained false positive results. 
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The Kohonen self-organising network was eventually identified as the most 
appropriate neural eomponent for the transform seleetor. The principal reason for this 
choice is that the Kohonen network employs an unsupervised learning algorithm in 
which it is not necessary to know in advance the 'correct' output for a given input. 
Once trained the organised network topology reflects the statistical regularities of the 
input data. This information is useful for exploratory data analysis and visualisation of 
high dimensional data. 



3.3 Data Sources 

The training data used to develop the Kohonen network is derived from a suite of 
standard Fortran-77 benchmarking codes. The sources inelude the Livermore loops 
[10] and Dongarra’s parallel loops [11]. Loops from these sources represent examples 
of real code; an important consideration if the eventual transform selector is to be 
capable of dealing with the intricacies of real input codes. A selection of Banerjee’s 
loops [12] were also included. By comparison with the other sources, these are not 
'real' codes but were included for their rich data dependence information. The 
eventual loop corpus contains 110 loops with a total of almost 16,000 dependencies 
and is considered sufficient to provide representative coverage of the problem space. 



3.4 Code Characterisation 

Code characterisation is a necessary pre-processing stage in which a set of feature 
vectors is generated for each loop in the code set. It is important that the 
characterisation scheme employed should capture information influential in the 
selection of loop transformations, particularly information on the nature of any 
dependencies present in the loop. The characterisation scheme used encodes 12 
features for each loop in a 20 component feature vector. Details of the 
characterisation scheme are shown in Table 2. 

It is essential that the eharacterisation scheme capture the complexities or real 
codes. In particular, aspects of a code loop which prevent dependence analysis, and 
hence the selection of an appropriate transform, must be considered. For example, the 
characterisation scheme must deal with symbolic dependencies where symbolic terms 
occur in loop bounds or array subscript expressions. Such symbolic terms impede 
dependence analysis yet empirical studies have shown that over 50% of array 
references and 95% of loop bounds in real programs contain non-linear, symbolic, 
expressions [11, 13]. Where precise dependence information is unobtainable the 
eharacterisation scheme encodes one of five possible reasons (Cat in Table 2) as 
follows: 

- Symbolie term in loop bound 

- Symbolic term in subscript expression 

- Symbolic term in loop bound and subseript expression 

- Dependence has non-constant direction vector 

- Linear expression too complex for dependence analysis. 
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A tool has been developed to automatically generate a set of feature vectors, as 
defined in Table 2, from Fortran-77 input code. Each feature vector produced 
corresponds to a single loop-carried dependence. As such, a given loop may give rise 
to a number of feature vectors, one for each dependence. The characterisation scheme 
also provides contextual information through a strength rating representing the 
proportion of each particular dependence type in a loop. This characteristic provides a 
measure of the complexity of the loop in terms of data dependencies. 

While the capability to deal with real codes is essential it is also necessary to limit 
the complexity of the problem in order to limit the dimensionality of the feature 
vectors. To this end all loops in the loop corpus contain assignment statements only in 
the loop body, are normalised and either single or perfectly double nested. 



Table 2. Characterisation Scheme 



Feature 


Mnemonic 




Type of dependence - 
flow, anti, output or input 


Type 


3 


Direction vector 


Direction 


2 


Category when precise 
information is unobtainable 


Cat 


5 


Flow dependence 
strength in loop 


§f 


1 


Anti dependence 
strength in loop 


8. 


1 


Output dependence 
strength in loop 


So 


1 


Input dependence 
strength in loop 


Si 


1 


Lower bound of outer loop 


LBI 


1 


Upper bound of outer loop 


UBI 


1 


Lower bound of inner loop 


LBJ 


2 


Upper bound of inner loop 


UBI 


2 



3.5 Network Training 

The feature vectors generated from the Fortran-77 loop corpus are used as input to 
train the network. Since the Kohonen network uses an unsupervised learning 
algorithm it is not necessary to know a priori which transformations are appropriate 
for a given loop. During training the Kohonen layer undergoes a self-organising 
process in which a two-dimensional map is produced representing the higher 
dimensional input space (a schematic of the network is shown in figure 3). An 
essential feature of the map produced is that it preserves the topology of the input 
space in that inputs which are ‘close together’ in input space are mapped to points 
‘close together’ on the Kohonen layer. In effect, points on the Kohonen map represent 
prototypes, or cluster centres, for the features vectors used during training. Thus, a 
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feature vector input to the trained network will be represented by a single prototype 
on the mapping layer. 




Neurons (prototypes) arranged as 
a 2-D grid to form the Mappimg 
Layer 



Mapping layer 

Winning 
Prototype 



OOOOOOOOOOO/ 
^OOOOOOOOOOO/ 
lOOOOOOOOOO. 
'OOOOOOO. 




Weighted eonneetions between tiie 
Input Layer and the Mapping Layer - 
full conneetivity 

Input layer 



ft ft tt t tt tt, 



Feature Vectors 



Fig. 3. Schematic of the Kohonen Network. Neurons (prototypes) in the mapping layer are 
spatially arranged as a 2-D grid. Inputs (feature vectors) are projected onto the prototypes in the 
mapping layer such that the topology of the input space is preserved. In this way the mapping 
layer effectively forms a dimensionally reduced ‘map’ of the input space. Further information 
on the Kohonen network and neural networks in general may be found in Haykin [14]. 



It is a fundamental assumption in this work that input codes that map to the same 
prototype on the Kohonen mapping layer, and are therefore ‘close together’ in input 
space, are amenable to the same transformation. The essence of the transformation 
framework then is to establish which transformation(s) are most appropriate for each 
prototype represented in the Kohonen layer. 



3.6 Labelling the Map 

A reverse engineering approach was adopted in labelling each prototype with the 
corresponding transformation(s). Firstly, a number of transformations were chosen, 
namely: 

- D Loop Distribution 

- I Loop Interchange 

- S Loop Skewing 

- E Scalar Expansion and 

- R Statement Reordering 

Although there are over 40 transformations available, it has been shown that relatively 
few are necessary to cope with the majority of dependencies found in real code [15]. 
The chosen transformations are among the most useful and widely used. They are also 
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the transformations supported within the KATT environment and will therefore 
facilitate direct comparison between the existing expert system based transform 
selector and the neural network based tool under development. 

For each chosen transformation a data set of representative codes was established 
where it is known that the code is amenable to the transformation. Each code is then 
characterised and the resultant feature vector presented to the trained network. In each 
case one prototype in the mapping layer is identified which best represents the input 
and that prototype labelled with the associated transformation. The result is a 
transformation framework; a labelled map that may be used to suggest 
transformations that should be applied to an input code in an attempt to reduce or 
remove loop-carried dependencies. 



4 Results 

As an illustrative example of how the transformation framework operates, consider 
the following loop: 

DO I = 4,200 

A(I) = B(I) + C(I) 

B(I+2) = A(I-l) + A(I-3) + C(I-l) 

A(I+1) = B(2*I+3) + 1 
CONTINUE 

This loop is input to the characterisation tool for analysis and generation of feature 
vectors. Analysis reveals a total of ten dependencies; five flow dependencies, two 
input dependencies, one output dependency and two assumed flow dependencies 
where complete analysis was prevented because the linear subscript expressions were 
too eomplex. As a result of the analysis four feature vectors, Xi to X 4 , are returned - 
one for each loop-carried dependency identified. Each feature vector is presented in 
turn to the Kohonen network and the position of the corresponding prototypes noted. 
These are shown at grid co-ordinates (7,11), (2,8), (4,1) and (5,11) on the labelled 
map in figure 4. 

Notice that the inputs do not all fall directly onto labelled prototypes. Table 3 
shows the distances from inputs to the nearest labelled prototypes. 

The best choice of transformation appears to be loop skewing (S) because it is the 
closest labelled prototype to a number of inputs and should remove most 
dependencies. However loop skewing is illegal, i.e., it does not apply to ID loops. 
The next best transformation is loop distribution (D), which is legal. Applying loop 
distribution results in two separate loops with fewer dependencies in each. 

DO I = 4,200 

SI: A(I) = B(I) + C(I) 

S2: B(I+2) = A(I-l) + A(I-3) + C(I-l) 

CONTINUE 

DO I = 4,200 

S3 : A(I + 1) = B (2*1 + 3) + 1 
CONTINUE 
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A second iteration of the procedure is now performed. After loop distribution the 
second loop is parallel, a fact detected by the characterisation tool, resulting in no 
further transformations being performed on this loop. The first loop has now three 
dependencies; SI 5f S2 by A, S2 8f SI by B, and S2 8; S2 with A. Repeating the same 
process as before results in two vectors, Yi and Y 2 , shown plotted at their 
corresponding prototypes in figure 4. 
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Fig. 4. Labeled map showing inputs Xi, X2, X3 and X4 from the first iteration and Yi, Y2 from 
the second iteration 



Table 3. Distanee from input vectors to nearest labelled prototypes 



Input 


Nearest 

Prototype(s) 


Transform 


Distance 


Xi 


(6,io) 


Distribution 


1.414 


X2 


( 1 , 8 ), ( 2 , 7 ), 
( 3 , 8 ) 


Skewing 

Distribution 


1.000 

1.000 


X3 


(1,2) 


Interchange 
or Expansion 
or Reordering 


3.162 


X4 


(5,11) 


Skewing 


0.000 



It can be seen from the map that Yj is closest to a prototype representing loop 
distribution (D). Input Y 2 is nearest loop skewing (S) but very near two prototypes 
also representing loop distribution (D). Therefore, loop distribution is the 
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recommended transformation. Application of this transformation gives the following 
loops: 

DO I = 4,200 

SI: A(I) = B(I) + C(I) 

CONTINUE 

DO I = 4,200 

S2 : B(I+2) = A(I-l) + A(I-3) + C(I-l) 

CONTINUE 

Both these loops may now be executed in parallel; hence there is no requirement for 
further transformation. 

This example demonstrates an obvious example of when to stop applying 
transformations, i.e., when all dependencies have been removed and the transformed 
loops can be executed concurrently. In other cases, determining when to stop 
iteratively applying the process is not so straightforward. The following stopping 
criteria are used in order of preference, namely: 

• all the dependencies are removed, 

• the user decides to apply no further transformations, 

• applying the recommended transformation does not reduce the number of 
dependencies - in this case the user may decide to proceed with the code after 
transformation or roll back to a previous state, and 

• the distance on the map between inputs and labelled prototypes is large - typically 
for any distance greater than 3 the network cannot confidently make a 
recommendation. 

The transformation framework presented here has been successful in recommending 
appropriate transformations for over 90% of the codes on which it has been tested. 
The recommendations made by the system have also been compared with those 
offered by the existing expert system within the KATT environment. Again, these 
results have been encouraging. The quality of the advice offered by both systems is 
comparable in terms of reducing dependencies and increasing concurrency however, 
the transformation sequences suggested often differ. This is not surprising since, as 
argued in section 2, the systems complement one another in that they each exploit 
different aspects of the underlying parallelisation knowledge model. In addition, the 
expert system has access to information on the target architecture and its 
recommendations are therefore informed by that knowledge. The neural network tool 
on the other hand derives its knowledge solely from the code and its 
recommendations are entirely independent of the target architecture. 



5 Conclusions 

A hitherto untried neural network based technique offering strategic guidance on the 
selection of appropriate code transformation sequences has been developed and 
implemented. The transform selector developed has been tested and has produced 
excellent results. Research is continuing into the possibility and desirability of 




A Neural Network Based Tool for Semi-automatic Code Transformation 



153 



extending the existing tool. Extensions may include using more than the five 
transformations originally chosen or labelling more prototypes by using more 
labelling codes. 

The next stage of the work is to integrate the neural-based transform selector and 
the existing expert system to form a hybrid transformation framework within the 
KATT environment. Such a hybrid would effectively combine the different views of 
transformation selection currently offered individually, thereby permitting a more 
comprehensive exploitation of the available parallelisation knowledge. An essential 
part of the integration will involve extending the expert system with the new 
knowledge gained as a result of analysing the Kohonen network at the heart of the 
new transformation framework. 
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Abstract. Parallel programming demands, in contrast to sequential 
programming, sub-task identification, dependence analysis and task-to- 
processor assignment. This paper presents a new parallelising tool that 
supports the programmer in these early challenging phases of parallel 
programming. In contrast to existing parallelising tools, the proposed 
parallelising tool is based on a new graph theoretic model, called anno- 
tated hierarchical graph, that integrates the commonly used graph the- 
oretic models for parallel computing. Part of the parallelising tool is an 
object oriented framework for the development of scheduling and map- 
ping algorithms, which allows to rapidly adapt and implement new algo- 
rithms. The tool achieves platform independence by relying on internal 
structures that are not bound to any architecture and by implementing 
the tool in Java. 



1 Introduction 

The programming of a parallel system for its efficient utilisation is complex 
and time consuming, despite the many years of research in this area. Parallel 
programming demands much more than sequential programming, since (i) sub- 
tasks which can be executed in parallel must be identified, (ii) the dependence 
between the sub-tasks has to be analysed and (iii) the tasks have to be mapped 
and scheduled to a target system. 

Due to the high complexity of parallel programming, research on methods, 
mechanisms and tools to support the parallelisation of tasks has been encour- 
aged. Many tools, languages and libraries have emerged from this research. Even 
though, the majority of them only support coding and result analysis, there are 
some tools, in this paper called parallelising tools, which address the challenging 
early phases of the development of a parallel program (e.g. CASCH [1], Task 
Grapher [2], PYRROS [3], CODE [4], Meander [5], or [6]). 

Parallelising tools commonly use graph theoretic models for the representa- 
tion of a program to be parallelised. Computation is associated with the vertices, 
and communication with the edges of the graph. The graph is generated from 
an initial description of the program to be parallelised (for example using a 
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proprietary task graph language [3]) or it is interactively constructed in the en- 
vironment of a parallelising tool [4]. The Directed Acyclic Graph (DAG) [7] is 
the most common model employed, and a lot of recent research has been per- 
formed in the area of mapping and scheduling algorithms (e.g. [8, 9]) for this 
graph model. A shortcoming of this graph model is, however, its incapability 
to model code cycles explicitly. Loops with a large or even, at compile time, 
unknown number of iterations cannot be represented with a DAG. Other graph 
models, like the Temporal Gommunication Graph (TGG) [10] or the Interactive 
Task Graph (ITG) [11] overcome this limitation. 

While the majority of the parallelising tools is based on only one graph model, 
which is mostly the DAG (e.g. [1, 2, 3]), they also often only use a limited 
number of scheduling and mapping algorithms. This further limits their area 
of application, since the scheduling and mapping heuristics often have affinity 
for certain types of applications and parallel architectures. In other words, a 
scheduling heuristic may produce good results only for some types of programs 
on certain architectures. An extensible and adaptable design of the parallelising 
tool is necessary to include algorithms tailored for different types of applications 
and parallel architectures. 

A myriad of scheduling heuristics for the DAG model [12, 13, 14, 15] has been 
developed by the parallel computing community. Some of these algorithms take 
a realistic hardware architecture into account [2, 16, 17]. For a parallelising tool 
to benefit from the programmers knowledge about the target system, it must 
provide a scheme for the specification of these characteristics in a way that the 
appropriate algorithms can use them. 

A parallelising tool can benefit from the natural strength of humans, such 
as information abstraction, pattern matching and problem decomposition, when 
it leaves some of the parallelisation decisions to the user. These decisions can 
be, for example, the choice of a scheduling algorithm, the characterisation of the 
target machine or the manipulation of the parallel program structure. A visual 
environment, which displays the parallel structure (in two or three dimensions) 
and allows interactive manipulations [4, 6] gives the user the most flexibility. In 
such an environment, the programmer can understand, correct and optimise the 
parallel structure. 

This paper proposes a parallelising tool that addresses the above discussed 
issues. The parallelising tool is based on a new hierarchical graph model that 
integrates multiple graph theoretic models, to support a wide range of appli- 
cations and parallel architectures. Its design allows the adaption and modular 
extension of parallelising algorithms, by integrating new algorithms for schedul- 
ing, mapping and structure manipulation. A visual 3D environment displays the 
parallel structure of the program, to support the parallelisation decisions of the 
programmer. For platform independence, the parallelising tool is implemented 
in Java. 

The rest of this paper is organised as follows: Section 2 presents an overview 
of the parallelising tool and introduces its principal components. After the clas- 
sification and comparison of graph theoretic models, a new graph model called 
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annotated hierarchical graph is presented in Section 3. The subsequent sections 
then discuss the components of the tool. Section 4 presents the type of inputs 
accepted by the tool, Section 5 describes the interactive visual environment, 
Section 6 analyses the module block that provides the algorithms, and Section 
7 provides information about the output of the parallelising tool. We finally 
conclude this article in Section 8. 

2 The Proposed Parallelising Tool 

The overall structure of the parallelising tool we are currently developing is 
shown in figure 1. As indicated by the shaded areas in the figure, the tool is 
divided into four main parts. The input part serves for the initial description 
of the task to be parallelised. Modules within this part generate an annotated 
hierarchical graph from a task defined, for example, in a sequential program- 
ming language (e.g. C) or as an equation (e.g. in syntax). The annotated 

hierarchical graph is further on the representation of the task to be parallelised 
and forms the core of the parallelising tool. In the central part, this graph is 
visualised in a 3D graphical environment. This environment allows the analysis 
and manipulation of the task’s parallel structure and, in addition, the possibility 
to interactively construct an annotated hierarchical graph in a direct way. To 
provide the algorithms and methods for parallelisation, such as structure manip- 
ulation, scheduling and mapping, the central part interfaces to a module block 
containing a pool of algorithms. Taking the annotated hierarchical graph as in- 
put, these algorithms can be executed by the user for mapping and scheduling a 
task onto a target machine. Algorithms that take the target machine’s character- 
istics into account, receive this information as an additional input, provided by 
the user. The mapping and the schedule of the task is the principal output of the 
tool. A programmer can then use traditional tools to code the task, guided by 
the obtained schedule and mapping. A long term objective in the development 
of the tool is to employ code generators which automatically build the program. 



3 Annotated Hierarchical Graph 

The annotated hierarchical graph forms the core of the proposed parallelising 
tool. Unlike other parallelising tools, our tool is not based on only one graph 
theoretic model. It rather tries to integrate various graph models and thus to 
benefit from their combined advantages. Before the annotated hierarchical graph 
is described, we, therefore, analyse and classify the utilised graph theoretic mod- 
els. 

3.1 Graph Theoretic Models 

In a graph theoretic abstraction, a parallel program consists of two activities: 
computation and communication. Computations or tasks are associated with the 
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Fig. 1. Overview of the parallelising tool 



nodes and communications with the edges of the graph. All instructions of one 
task are executed in sequential order, i.e. there is no parallelism within one task. 

In [18] a classihcation scheme for graph theoretic models was proposed. In 
the context of a parallelising tool, a graph model is distinguished according to 
(i) the parallel computations that can be modelled, (ii) the supported parallel 
architectures and (iii) the available algorithms. 

Within these three classification groups the models are analysed according 
to (a more detail description of the classihcation scheme is in [19, 18]): 

1. Parallel computations 

— Granularity - hne grained, medium grained and coarse grained 

— Iterative computations - how are iterative computations modelled? 

— Regularity - can the model represent regularity explicitly? 

2. Parallel architectures 

— Data and instruction stream - SIMD and MIMD streams 

— Memory architecture - shared memory (UMA), distributed memory and 
shared distributed (NUMA) memory 

— VLSI systems - VLSI array processors with synchronous data and control 
how 

3. Proposed algorithms 

— Dependence analysis and exploitation of parallelism 

— Task-to-processor mapping 

— Scheduling 
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To choose a graph theoretic model for the parallelising tool, we have analysed 
and compared the common graph theoretic models using the above classification 
scheme [19]. From the many existing models, DAG, ITG, and TCG, turned 
out to be of interest for the parallelising tool. In the following sections these 
three models are briefly discussed and the aspects which had influence on the 
annotated hierarchical graph structure are pointed out. 



Directed Acyclic Graph (DAG) The designation DAG [7] merely reflects the 
graph theoretic nature of this graph model, which is consequently directed and 
acyclic. Interesting for a parallelising tool are node and edge weighted DAGs, 
as they well reflect parallel computations with non-uniform computation and 
communication costs. 

The acyclic property of the DAG imposes a restriction on how parallel com- 
putations are modelled. Iterative computations, which build a cyclic structure, 
are urged to be modelled in a certain form. A coarse-grained approach consists 
in projecting the iterative part of a parallel computation onto one task. On the 
other hand, in a fine-grained approach only the tasks of one iteration are mod- 
elled, without taking into account the inter-iteration dependence. For a complete 
fine-grained representation, the iterative computation may be ’unrolled’ where 
each iteration is represented by its own sub-graph, and these sub-graphs are 
connected according to the inter-iteration dependence. 

In the last approach, the size of the DAG increases linearly with the number 
of iterations. In practice, this representation may generally be used only for small 
numbers of iterations. Also, the number of iterations has to be known at compile 
time; iterative computations whose iteration number is only known at runtime 
cannot be modelled with this approach. 

The typical granularity of modelled computations is coarse-grained, as the 
alternative designations as task graph and macro-dataflow graph indicate. The 
DAG is usually employed for mapping and scheduling on distributed-memory 
architectures. Node and edge weighted DAGs are only used for MIMD streams, 
since no spatial regularity is exploited by the DAG model, which is necessary to 
support SIMD streams. 

A myriad of mapping and scheduling algorithms [12, 13, 14, 15] were proposed 
based on node and edge weighted DAGs, whose directed and acyclic properties 
allow efficient scheduling algorithms. 

Iterative Task Graph (ITG) The ITG [11] belongs to the group of data 
flow graphs, which model the flow of data or signals in a computation. Data 
flow graphs are directed graphs, but in contrast to the DAG model, data flow 
graphs can incorporate cycles and allow thus a more compact representation of 
computations. For a data flow graph to represent a valid computation, cycles 
must include at least one delay [20]. A delay is represented by a weight asso- 
ciated with an edge. It may be expressed as a multiple of a time unit (often 
denoted D) or the number of iterations the communication between two nodes 
is to be delayed. A delay “breaks” the precedence-constraint cycle and allows 
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thus a computable representation of iterative computations. Owing to the delays, 
data flow graphs can model intra-iteration (without delays) and inter-iteration 
(with delays) dependence. The efficient scheme for representing iterative com- 
putations, reduces essentially the number of nodes in a graph compared to an 
unrolled DAG and allows nondeterministic numbers of iterations. Moreover, it- 
erations are represented in a more intuitive and structured way. Since iterative 
computations can be modelled in detail, the granularity is typically fine-grained 
to medium-grained. 

The Iterative Task Graph also contains edge and node weights to reflect the 
computation and communication costs (figure 2a), which are mainly used for 
exploiting parallelism, for mapping and for scheduling. Unfolding, re-timing or 
software pipelining are some examples of the transformations applied to parallel 
computations represented by this model. The ITG is appropriate for compu- 
tations with arbitrary costs on (shared) distributed memory architectures and 
VLSI systems, given that it explicitly provides computation and communication 
costs. As no regularity is included in the ITG, except for iterative computations, 
MIMD streams are the typical data and instruction streams supported by the 
model. For SIMD streams, the ITG lacks spatial regularity. 




(a) (b) 



cf/cy- computation cost; wi - communication cost; iD - delay; pi - process 
Fig. 2. The Iterative Task Graph (a) and the Temporal Gommunication Graph 

(b) 



Temporal Communication Graph (TCG) The Temporal Communication 
Graph [10] is based on the space-time diagram introduced by Lamport [21]. 
The TCG is a directed and acyclic graph that is process and phase oriented. A 
computation is divided into sequential processes pi,p 2 , ■ ■ ■ ,Pn and every node 
of the graph is associated with exactly one process. A node, associated with 
process pi, has at least one edge pointing to its direct successor on process pi 
(intra-process dependence) and may also have a communication edge to a node 
of another process pj (inter-process communication) . A node weight reflects the 
computation costs and a weight associated to an inter-process edge represents 
the communication costs. Gommunication between nodes of the same process is 
considered to be negligible. Figure 2b illustrates a TGG with three processes. As 
the graph built by the nodes and edges is a directed and acyclic graph it may be 
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considered a node and edge weighted DAG with zero communication costs for 
intra-process edges. 

A TCG can be described with the aid of the LaRCS [10] graph description 
language, which allows to specify phases of computation and communication. 
These phases may exploit spatial and temporal regularity, for example a loop is 
a phase of temporal regularity. On the one hand, this process and phase oriented 
perspective of parallel computations limits the flexibility of the model. On the 
other hand, the TCG draws its power from this scheme, as iterative and other 
regular computations are described in an efficient way. The process-oriented view 
is, moreover, intuitive to many programmers for describing computations. 

In the OREGAMI tool [22] the TCG is used for mapping and scheduling. 
The algorithms employed use the regular structure of the TCG. Furthermore, 
mapping and scheduling algorithms based on the Task Interaction Graph (TIG, 
an undirected graph model [19]) and on the DAG model may also be used. 

The TCG, considered as a DAG, underlies the same limitations to model an 
iterative computation as the DAG model itself. However, the TCG can be treated 
in parts due to its description in phases. In the OREGAMI tool, for example, 
successive portions of the graph are generated for mapping and scheduling as 
needed. As a result, the complexity and the memory requirement are reduced 
for mapping and scheduling. 



Relations Between the Graph Models Apart from the characteristics dis- 
cussed above, there exist relations between the graph models, some of which were 
already mentioned during the above discussion. These relations can be used to 
transform one graph representation into another, for the purpose of applying an 
algorithm only available for one graph representation or to yield a more com- 
pact representation. In figure 3 the relations of the discussed graph models are 
depicted. Three types of transforms between graph models can be defined: re- 
duction, projection and unrolling. By reduction of graph properties, a complex 
graph model can be transformed into a simpler model, whose properties build a 
subset of the complex model’s properties. With projection, a graph model can 
be transformed into another model with a more compact representation of the 
parallel computation. The reverse process to projection is unrolling. The ITG, 
for example, is unrolled to a DAG by constructing a sub-graph for every iteration 
and connecting these sub-graphs according to their inter-iteration dependence. 

3.2 Structure of the Annotated Hierarchical Graph 

A conclusion drawn from the comparison of the graph theoretic models is that 
none of the graph models is universal. In other words, none of the models is 
capable of representing every type of application. DAGs are mainly used for a 
coarse grained representation, ITGs can only model iterative computations and 
TGGs are limited by its process and phase approach. 

Inspired by the TCG, we developed a hierarchical model that can represent 
coarse grained as well as iterative computations, but which is not limited by 
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Fig. 3. Relations between graph models 



a process-phase orientation. The principal idea is that a node of a graph can 
itself be again a graph, as shown in figure 4a. In this example, the directed main 
graph consists of the nodes ni, n 2 , na, of which node na is itself a directed graph 
with the nodes nai,n 32 ,naa. The subgraph is cyclic and represents an iterative 
computation, whose dependence cyclic is broken by the delay D on the edge 
between node naa and nai. 

Formally, a hierarchical graph G is a pair {V,E), where F is a finite set of 
vertices connected by a finite set of edges E. An element e = (u, v) of E denotes 
an edge between the vertices m, w G V. An edge {u,v) denotes an edge from u 
to V and, therefore, {u,v) ^ (v,u). Note, that loops and self loops are possible. 
A vertex u € V can itself be a hierarchical graph G„, where the edges entering 
vertex u, (v, u) e E, v e V, enter the source vertices of G„ and the edges leaving 
vertex u, (u,v) e E,v € V, leave the sink vertices of G„. Source vertices are 
those vertices which, after removing all edges with delay, have no entering edge 
and sink vertices are those which, after removing all edges with delay, have no 
leaving edge. In the example of figure 4a node nai is a source vertex and node 
naa is a sink vertex. 

The hierarchical graph can be made a simple directed graph by substituting 
the nodes that are themselves graphs with their respective graphs. This is shown 
for the example of figure 4a in figure 4b, where node na was substituted by its 
graph. 

The hierarchical graph model permits to represent iterative and none itera- 
tive computations in one graph model in a compact form. Therfore, this graph 
integrates the DAG and the ITG into one model without being limited by a 
process and phase orientation like the TCG. Depending on the purpose, the 
graph can be interpreted in various ways. An algorithm may consider only the 
coarse grained task graph (only considering the highest hierarchical level) or the 
sub-graphs can be treated separately. It is also possible to expand the graph to 
a flat directed (cyclic) graph as done with the example from figure 4a in figure 
4b. The latter can further be unrolled (supposed that the number of iterations 
is known for cyclic parts) to a DAG. The hierarchical graph model provides the 
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flexibility to represent a wide range of applications and, still, all the algorithms 
proposed for the three graph models discussed above can be employed. 





Fig. 4. Hierarchical graph (a) and expanded hierarchical graph (b) 



Annotation Associated with every node of the hierarchical graph is an annota- 
tion. This annotation is in textual form and represents the computation executed 
by the node. Code in C or VHDL are two examples for textual annotations. The 
annotation may serve for the estimation of the computation time or it may be 
used for automatic code generation at the output of the parallelising tool. More- 
over, the textual task description allows to estimate the execution costs of the 
task depending on the architecture of the target machine. 

The communication of the edges is also described by an annotation, which 
represents the data structure transmitted on the edge and the amount of data. 
Again, this can be used to estimate the communication costs or for the code 
generation at the output of the tool. For example, a sendO and a receive () 
command may be inserted, with the respective data structure as parameter, in 
the code of the source and sink node of the edge, respectively [1]. With the 
knowledge of the amount of data transmitted on the edge, the communication 
costs can be estimated according to “startup cost -1-transmission speed x amount 
of data” [11], considering the target machine’s characteristics. 

The interpretation of the annotations is left to the various algorithm and, 
thus, the annotated graph structure is not linked to a certain form of task or 
communication representation. Also, the graph representation of a program is 
platform independent, as costs can be estimated only when needed. Note, that a 
textual description of a task or communication is, of course, not obligatory, and 
the user can also provide estimated costs. 

4 Program Input 

A program to be parallelised must be initially described in an adequate form to 
be analysed in a parallelising tool. This description is crucial for the exploita- 
tion of parallelism. Various approaches for this initial description are used in 
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parallelising tools. Parallelising compilers commonly use an augmented sequen- 
tial programming language (e.g. HPF, pC-|--l-, Split-C) and concentrate on the 
parallelisation of (cost intensive) loops [23]. The CASCH parallelising tool [1] 
takes as input a program with procedure calls and creates a node for each proce- 
dure in a DAG representing the program. A proprietary task graph language is 
used by PYRROS [3] for the construction of the DAG. The structure of the DAG 
is constructed interactively in CODE’S environment [4]. In [6] a set of algebraic 
equations is entered in an iterative editor, from which a dependence graph is 
generated. 

The core of our parallelising tool is the annotated hierarchical graph. Conse- 
quently, every initial input from which such a graph can be generated is feasible. 
Therefore, the proposed parallelising tool is not limited to any particular form 
of initial description. The input is rather modularised to allow different initial 
descriptions. This is even important, as the tool supports coarse grained as well 
as iterative computations. The above referenced parallelising tools use one initial 
description depending on the type of computation they support. 

As shown in figure 1, we currently implement two types of initial task descrip- 
tion. One is a description as simple algebraic equations. From certain equations 
found in signal processing, for example recurrence equations, it is straight for- 
ward to generate a dependence graph [20]. In contrast to [6], an equation is, 
however, specified in textual form and not interactively entered. The equation 
is then parsed and analysed and a hierarchical graph is generated. 

The other type of description that generates iterative structures is the spec- 
ification of loops with a simple subset of the G language. For the analysis of the 
code, techniques found in automatic program parallelisation are employed [23]. 
The annotations of the graph’s node consist of the code parts found in the initial 
description of the task. 

Goarse grained computations can be specified as an interactively constructed 
graph in the visual environment which is presented in the next Section. 



5 Visual Environment 

The central part of the parallelising tool is a visual environment (figure 1). Here, 
the graphs generated from the initial descriptions are visualised. When fully 
implemented, the graph is displayed in three dimensions and the programmer 
can change the viewpoint, edit annotations and manipulate the structure of 
the graph. Moreover, a full annotated hierarchical graph can be interactively 
constructed in this visual environment. It is also possible to construct the coarse 
grained structure of a program and to use the other input types to generate finer 
grained iterative computations that can be integrated in the overall structure. We 
are currently integrating a first experimental environment into the parallelising 
tool. Figure 5 shows the graph of a localised matrix-matrix multiplication (N=4) 
visualised in the experimental environment. 

With the interactive environment, the parallelising tool can benefit from the 
natural strength of humans, such as information abstraction, pattern matching 
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Fig. 5. Visualised graph for a matrix-matrix multiplication (N=4) 



and problem decomposition. The programmer can take decisions, which could 
not, or only unsatisfactorily, be taken automatically. These decisions can be, for 
example, the choice of a scheduling algorithm or the manipulation of the parallel 
program structure. In such an environment, the programmer can understand, 
correct and optimise the parallel structure. 

The scheduling, mapping and structure manipulation algorithms, which can 
be applied to a graph, can be chosen from a set of algorithms provided by the 
module block, to which the visual environment interfaces. 



6 Module Block 

The module block is responsible to provide the parallelising algorithms for our 
parallelising tool. It is conceived to provide a wide range of algorithms and to be 
adaptable and extensible for new algorithms. To achieve this goal, we developed 
an object oriented framework for scheduling and mapping algorithms. 

The framework was implemented in Java and we employed the Collection 
Framework of Java 2 for the basic data structures. It is composed of the following 
packages: 

— graph - This package comprises a class hierarchy for a general graph frame- 
work. At the top of the hierarchy is a multi-graph which permits directed 
or undirected edges, cycles and parallel edges and the hierarchy goes down 
to trees and directed acyclic graphs. In conjunction with these classes, basic 
graph algorithms like BFS, DFS, topological order or connected components 
are provided. 

— hierarchicalgraph - This package contains the classes for the annotated 
hierarchical graph. It is based on the graph, package and provides the ele- 
ments for the hierarchical structure and the annotations. 

— schedule - In the schedule package, the classes to represent mappings and 
schedules of programs represented by graphs are provided. Methods for the 
visualisation of a schedule as a Gantt chart are also included. 

— architecture - This package is used for the definition of the target ma- 
chine’s architecture. It is also based on the graph package, as the architecture 
of a parallel machine is described as a graph. 
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The hierarchical graph class provides basic functions that are commonly used by 
scheduling and mapping algorithms. Examples are the calculation of the bottom 
or top level of a node, the critical path, or the unrolling of cycles. These functions 
reduce the effort for a programmer to implement a new scheduling algorithm. 

As the annotated hierarchical graph is based on multiple graph models, al- 
gorithms based on different models can be employed in the parallelising tool. 
The most scheduling and mapping algorithm were proposed for the DAG model 
[12, 13, 14, 15]. As mentioned in the introduction, only few algorithms take the 
target system’s architecture into account. Since we expect better results from 
these algorithms, three of them - MH [2], DLS [16], BSA [17] - were among the 
first algorithms implemented for the parallelising tool. 

Apart from the DAG algorithms, algorithms based on the ITG are being 
implemented. For this graph model unfolding, re-timing and software pipelining 
are popular techniques [24, 25, 11]. Some of these algorithms utilise again DAG 
scheduling algorithms for partially unfolded ITGs. 

To benefit from regular structures of graphs, especially from graphs derived 
from equations, techniques known from the VLSI processor design [20] are em- 
ployed. These techniques use the regular structure to project (map) nodes to 
processors and to schedule synchronous executions. 

7 Output 

The output of the tool is the parallel structure of the program, the schedules and 
mappings. From the initial description of the program the user obtained a parallel 
structure displayed in the visual environment. By manipulating this structure, 
applying mapping and scheduling algorithms the user receives a guideline for 
the implementation of the program with the classical parallel tools. The user 
can read off the partition into subtasks from the displayed graphs as well as the 
dependence between these sub-tasks. The schedules displayed as Gantt charts 
allow the programmer to define the execution order and/or starting time of the 
sub-tasks. The tool permits the programmer to experiment with mapping and 
scheduling algorithms before the actual implementation of the program. This is 
in contrast to classical parallelisation tools that display the performance of the 
program after the implementation. 

A long term objective of the parallelising tool is to utilise code generators for 
the implementation of the program. A code generator can include communication 
and synchronisation primitives in the code annotated to the nodes, according to 
the edges of the graph [1, 4]. This can be done in a platform independent form 
(e.g. G with MPI functions, VHDL) or by utilising communication primitives for 
the specific architectures and platforms. 

8 Conclusions 

This paper presents a new parallelising tool based on graph theoretic models. Its 
design is platform independent as it is implemented in Java and as its internal 
structures are not bound to any architecture. 
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We proposed a new graph theoretic model, called the annotated hierarchical 
graph, that integrates the wide spread models DAG, ITG and TCG. This graph 
model renders the tool universal, since it is capable of representing coarse grained 
and iterative computations in a single model and in a compact form. Existing 
parallelising tools are limited in their range of supported applications. With the 
annotated hierarchical graph, techniques from different areas of scheduling and 
mapping research are integrated into one tool. 

We described the elements of the parallelising tool and pointed out its mod- 
ular structure. The proposed framework for the development of parallelising 
algorithms allows to rapidly implement and adapt new algorithm for the paral- 
lelising tool. Moreover, the tool permits to develop new algorithms without the 
need for a proprietary test environment. The visual environment supports the 
user in parallelising decisions, because the display of the parallel structures of a 
program helps the user to understand, correct and optimise these structures. 

In the future, the advantages of this new tool has to be demonstrated with the 
parallelisation of programs that benefit from the graph’s hierarchical structure. 
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Abstract. This paper analyzes the impact of hardware multithread- 
ing support on the performance of distributed shared-memory (DSM) 
multiprocessors built out of heterogeneous, single-chip computing nodes. 
Area-efficiency arguments motivate a heterogeneous, hierarchical orga- 
nization (HDSM) consisting of few processors with extensive support for 
instruction-level parallelism and large caches, and a larger number of 
simpler processors with smaller caches for efficient execution of thread- 
parallel code. Such heterogeneous machine relies on the execution of 
multiple threads per processor to deliver high performance for unmod- 
ified applications. This paper quantitatively studies the performance of 
HDSMs for software-based and hardware-multithreaded scenarios. The 
simulation-based experiments in this paper consider a 16-node multipro- 
cessor, six homogeneous shared-memory benchmarks from the SPLASH- 
2 suite, and a decision-support application (C4.5). Simulation results 
show that a hardware-based, block-multithreaded HDSM configuration 
outperforms a software-multithreaded counterpart, on average, by 13%. 



1 Introduction 



Continuing technological advances in VLSI manufacturing are predicted to bring 
about billion-transistor chips in the next decade [15]. Such large transistor budget 
allows for the implementation of high-performance uniprocessors [12] that ag- 
gressively exploit instruction-level parallelism (ILP), as well as chip- multiproce- 
ssors [8] that can efficiently execute explicitly parallel tasks. 

Large multiprocessor configurations of the future will be able to use such 
high-performance components as commodity building blocks in their design. Pre- 
vious work [6] has shown that combining nodes of different processor and memory 
characteristics into a heterogeneous distributed shared-memory (HDSM) mul- 
tiprocessor leads to area-efficient designs. 
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An HDSM combines few high-performance processors and memories with a 
larger number of simpler processors and smaller memories to form a hierarchi- 
cal, heterogeneous system [1] capable of fast execution of both sequential and 
parallel codes. Figure 1 depicts the organization of an HDSM. 




Level 1 



Level 2 



Level 3 



Fig. 1. HDSM: processor-and- memory hierarchical organization. Processors and 
memories are drawn such that processor performance and memory capacity are 
proportional to their area in the figure. 



The proposed heterogeneous DSM machines rely on the execution of mul- 
tiple threads per processor to deliver high performance for unmodified, homo- 
geneous applications. Previous work has studied the performance of HDSMs 
assuming a software multi-tasking model based on voluntary context switches. 
This model is valid for current commodity microprocessors that do not provide 
hardware mechanisms to implement fine-grain multithreading. However, hard- 
ware multithreaded microarchitectures are currently being used in commercial 
processors [16] and considered in the implementation of future-generation high- 
performance microprocessors [4]. 

This paper extends the performance studies of HDSMs reported in [6] by 
quantitatively analyzing the impact of hardware multithreading on their perfor- 
mance. This paper also complements previous work by employing a simulation 
model that explicitly accounts for heterogeneity of processor performance due 
to ILP. 

The quantitative analysis is performed via simulation of parallel benchmarks 
from the SPLASH-2 suite [18] and of a hand-paralellized decision-support ap- 
plication (C4.5 [13]). Benchmarks are simulated individually to study single- 
program parallel speedup. All benchmarks are programmed with single-program, 
multiple-data (SPMD) extensions to the C language. 
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A modified version of the RSIM [10] multiprocessor simulator is used in the 
experiments. The original RSIM simulator models DSM machines built out of 
homogeneous ILP processors, with no hardware support for multithreading. It 
has been modified for the performance analysis shown in this paper to model 
heterogeneity of ILP processors and caches, and to model hardware support for 
multithreading. 

This paper is organized as follows. Section 3 describes the heterogeneous 
DSM machine model studied in this paper. Section 3 presents the experimental 
methodology used in the performance study. Section 4 presents experimental 
results and data analyses. Section 5 concludes this paper. 

2 Machine Model 

2.1 Heterogeneous DSMs 

HDSM machines differ from conventional distributed shared-memory multipro- 
cessors in that processors, memories and networks of HDSMs may be heteroge- 
neous. In this paper, processor heterogeneity is modeled in terms of degree of 
support for ILP. Heterogeneity in the memory subsystem is modeled in terms of 
LI and L2 cache sizes and access times. Heterogeneity of the network subsystem 
is not modeled in this paper. 

The heterogeneity of processors and caches is motivated by area/parallelism 
tradeoffs in the design of future-generation microprocessors: the system consists 
of a combination of few, aggressive uniprocessors with large caches and many 
simpler processors with smaller individual caches. The former processors devote 
large numbers of transistors to deliver high performance for sequential codes, 
while the latter processors have smaller silicon area requirements and deliver 
high performance for parallel codes. 

The area/parallelism argument that motivates the design of HDSMs is based 
on the use of area-efficient simple processors for execution of parallel codes, and 
aggressive uniprocessors for execution of sequential codes. For highly parallel 
tasks, the high-performance uniprocessors can also be assigned to parallel com- 
putation. 

Previous work has shown that a software-based assignment of multiple threads 
to the high-performance ILP uniprocessors of an HDSM yields performance im- 
provements for memory- and cpu-intensive programs [6]. Context switches in 
software multi-tasking occur infrequently, and have large execution time over- 
heads. Such coarse-grain model limits the potential for overlapping high-latency 
shared-memory DSM operations. 

Research on multi-threaded processors has shown that aggressive ILP unipro- 
cessors can be enhanced to support multiple threads with small increases in 
chip area requirements [4]. The implementation of hardware multi-threading 
extensions into the aggressive ILP processors of an HDSM can increase over- 
all system performance by increasing the potential for overlapping of shared- 
memory accesses. To investigate the performance of such enhanced system, the 
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high-performance processors of the HDSM machine modeled in this paper have 
hardware support for block-multithreading. 



2.2 HDSM Multiprocessor Configuration 

The HDSM multiprocessor under study consists of sixteen nodes. Each node 
contains a single processor, LI and L2 data caches, main memory and a remote 
access device (RAD) with network interface and coherence controller. The nodes 
are interconnected by a 2-D mesh. Cache coherence is maintained via a directory 
controller that implements the MESI [11] protocol. The release consistency [7] 
memory model is assumed in this study. Figure 2 depicts the machine model. 



I Level 1 □ Level 3 
□ Level! 




Fig. 2. HDSM model: each heterogeneous node has a single processor (P), two 
levels of data cache (L1,L2), main memory (MEM) and a remote access device 
(RAD), all connected by a memory bus. Nodes are interconnected via a mesh 
network. 



Heterogeneity is present in both the processor and memory subsystems of 
the HDSM machine. Processor heterogeneity is modeled in terms of the size 
of hardware structures dedicated to ILP exploitation. The heterogeneous ILP 
parameters investigated in this paper are issue rate, instruction window size, 
number of arithmetic (ALU), floating-point (FPU) and address units, and max- 
imum number of outstanding cache misses (MSHRs [9]). Heterogeneity in the 
memory subsystem is modeled in terms of the size and speed of caches. 

The HDSMs under study have three levels, with 2, 4 and 10 nodes in levels 
1, 2 and 3, respectively. The machine is configured as a processor-and-memory 
hierarchy [1]; the number of processing elements increases from top to bottom 
levels of the hierarchy, while cache sizes and the performance of processors and 
cache memories decrease from top to bottom levels. Table 1 shows the processor 
and memory configurations assumed for each machine level. 

The inter-processor network is assumed to be homogeneous. This assump- 
tion is conservative in accounting for inter-processor communication latencies. 
Given the predicted integration level of next-generation microprocessors, it is 
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level i= 


1 level i= 


2 level i=3 


number of processors (i) 


2 


4 


10 


issue witdh(i) 


8 


4 


1 


instruction window size(i) 


128 


64 


8 


number of ALU/FPU/address units(i) 


4 


2 


1 


number of MSHRs(i) 


12 


8 


4 


LI cache size(i) 


32KB 


16KB 


8KB 


L2 cache size(i) 


1MB 


256KB 


64KB 


L2 cache miss detection latency (i) 


10 


5 


3 


L2 cache hit latency (i) 


25 


13 


8 



Table 1. 3- level, 16-processor heterogeneous machine configuration. L2 cache 
miss detection and hit latencies are shown in terms of clock cycles. 



conceivable that HDSM levels built out of simple processors be integrated into 
single-chip multiprocessors [8, 6]. Such configuration would allow smaller intra- 
level latencies than those assumed in the machine model. 



2.3 Heterogeneous Node Configurations 

The configuration of the level-3 processor is based on a simple out-of-order micro- 
processor pipeline that issues one instruction per cycle. The level-2 configuration 
is based on current high-performance, out-of-order microprocessor designs [5]. 
The high-performance level- 1 processor is based on predicted configurations of 
future-generation ILP microprocessors [3, 14]. 

The cache sizes of the level-1 processor are dimensioned so that the LI and 
L2 data caches are large enough to hold the primary and secondary working 
sets, respectively, of the SPLASH-2 benchmarks [18]. Cache sizes of lower- level 
processors are scaled down (with respect to the adjacent upper level) by factors 
of 2 (LI cache) and 4 (L2 cache). 

The LI cache access times are assumed to be a single processor cycle for 
all processor configurations: it is assumed that clock cycles are the same for all 
processors and that the level- 1 caches are designed to match the pipeline clock. 
The L2 cache tag and data access times are modeled after the analytical cache 
access time model described in [17], assuming a 0.18/rm technology [14]. 

The remaining processor and memory simulation parameters are homoge- 
neous across HDSM nodes and are set to the default values of the original RSIM 
simulator. 



2.4 Programming Model 

This paper considers the execution of homogeneous parallel applications on 
HDSMs. These programs are written in the single-program, multiple data (SPMD) 
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model. The homogeneous programs are mapped onto heterogeneous resources 
without source code modifications via static thread-to-processor assignment sche- 
mes. The next subsection details the two assignment schemes studied in this 
paper. 

2.5 Multi-threading Model 

In this paper, two policies are considered in the assignment of threads to hetero- 
geneous processors. In the virtual-processor policy, both software and hardware 
support for multithreading are studied. 

1. Single-thread: one thread is assigned to each processor in the system. 

2. Virtual-processor: in this scheme, a processor Pi is assigned VP{i) threads, 
where VP{i) is the ratio between Pj’s performance and the slowest processor 
in the system. This ratio is obtained from the uniprocessor simulation results 
summarized in Figure 3 (benchmarks that require power-of-two number of 
processors are assigned 5, 3, and 1 threads to processors in levels 1, 2, and 3, 
respectively) . There are two different multithreading scenarios studied under 
this assignment policy: 

(a) Software multithreading: in this scenario, thread context switches are 
triggered only by failed synchronizations on locks and barriers. To imple- 
ment this switching criterion, the RSIM synchronization library has been 
modified to include a voluntary context-switch call in the spin-waiting 
loop of the synchronization operations. The software context-switching 
overhead is modeled in the simulator by forcing the switching proces- 
sors to be idle for a configurable number of clock cycles. The context 
switching overhead in this scenario is 800 processor cycles. 

(b) Hardware multithreading: in this scenario, hardware support for block- 
multithreading [2] is available in the HDSM level-1 and level-2 proces- 
sors. Thread context switches are triggered by the following criteria (in 
addition to failed synchronization): when L2 cache misses occur, when 
the number of cycles without any instruction graduation exceeds the 
threshold Tgrad, and when the total number of cycles without any thread 
context switch exceeds the threshold Tawitck- In this paper, Tgrad and 
Tswitch are set to 20 and 10000 processor cycles, respectively. The context 
switching overhead is this scenario is set to 4 processor cycles. In addi- 
tion, threads are guaranteed not to be context-switched for a minimum 
run length of 4 cycles. 

3 Experimental Methodology 

3.1 Benchmarks 

The set of benchmarks used in this paper includes six programs from the SPLASH- 
2 [18] suite and a parallelized version of the decision-support database program 
C4.5 [13]. 
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The programs (and respective data sets) studied in this paper are C4-5 (adult 
dataset with unknowns removed and a minimum node size of 100), FFT (16K 
points), FMM (4096 particles), LU (256x256 matrix). Ocean (258x258 ocean). 
Radix (512K integers) and Water (512 molecules). All benchmarks are compiled 
with Sun Microsystem’s Workshop C compiler version 4.2 and optimization level 
-x04. 



3.2 Simulation Environment 

The simulation environment is based on a modified version of the RSIM simula- 
tor [10] that models a release-consistent DSM machine connected by a 2-D mesh, 
with uniprocessor heterogeneous nodes with support for block-multithreading. 



4 Experimental Results 

In this section, the performance of HDSMs is analyzed for the thread assign- 
ment schemes described in Section 3. Initially, the relative performance of the 
individual heterogeneous processors is discussed. Subsequently, the impact of 
multithreading on HDSM performance is analyzed. 



4.1 Impact of ILP Heterogeneity on Single-Node Performance 

Figure 3 shows the performances of the heterogeneous processors and caches in 
terms of speedups with respect to a base (level-3) processor. The level-2 and 
level-1 processors outperform the single-issue level-3 processor, on average, by 
277% and 396%, respectively. Since clock speeds are assumed to be the same for 
all processors, the performance differences between the heterogeneous processors 
are due to instruction-level parallelism and cache sizes only. 

Figure 3 shows that an eight-fold increase in issue rate and a sixteen- fold 
increase in L2 cache yield an average four-fold performance improvement of the 
level-1 processor over the simple level-3 processor. This result is consistent with 
the area-efficiency analysis based on a case study of Alpha microprocessors pre- 
sented in [6] . The increase in chip area necessary to implement larger caches and 
structures devoted to the extraction of ILP yields sub- linear gains in performance 
under the assumption of same fabrication technology (and clock cycle). 

4.2 Parallel Speedup Analysis 

Figure 4 shows the speedups of the 16-node HDSM with respect to the base 
(level-3) processor for the three different assignment scenarios described in Sec- 
tion 3. In the virtual-processor assignment, 4, 2 and 1 threads are assigned to 
level-1, level-2 and level-3 processors, respectively (except for benchmarks that 
require power-of-two processors, where 5, 3 and 1 threads are assigned to proces- 
sors of levels 1, 2 and 3). The simulation results show that the virtual-processor 
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Uniprocessor speedup 




□ Level 3 

□ Level 2 
■ Level 1 



Fig. 3. Simulated uniprocessor speedups (with respect to level-3 processor) of 
the heterogeneous configurations, shown in Table 1. 



assignment significantly outperforms the single-thread assignment under both 
studied multithreading models. The average virtual-processor speedups are 28% 
and 45% for the software and hardware multithreaded schemes, respectively. 

The hardware multithreading model outperforms the software model for all 
benchmarks except Radix; the largest performance improvement is observed in 
FFT (21.6%), followed by C4.5 (19.6%), FMM (16.5%), Ocean (15.4%), LU 
(9.4%) and Water (7.0%). For Radix, the hardware multithreading model per- 
forms as well as the hardware model. These results can be explained with a 
closer analysis of the execution time in the level-1 processor. 

Figures 5, 6 and 7 show a breakdown of the execution time in one of the 
level-1 processors into three components: busy, stalled on memory accesses and 
stalled on synchronization (locks and barriers) for the three assignment scenarios 
of Figure 4. 

In the single-thread case (Figure 5), the high-performance level-1 processor 
spends most of its execution in synchronization points. Since this assignment 
does not account for heterogeneity in processor performance, the level- 1 proces- 
sor is often waiting to synchronize with lower-level (slower) processors to proceed 
with computation. 

In the software multithread case (Figure 6), the level-1 processor spends less 
time in synchronization relative to actual computation. The load-balancing prop- 
erty of the virtual-processor scheme allows the level- 1 processor to perform more 
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HDSM speedup 




□ ST 

□ MT-SW 
■ MT-HW 



Fig. 4. Simulated HDSM speedups (with respect to level-3 processor) for single- 
thread and virtual-processor assignments (software and hardware multithreading 
models) . 



computation before attempting to synchronize with lower-level processors, and 
hence the synchronization component is reduced significantly. Since the proces- 
sor spends less time in synchronization points, the (relative) busy and memory 
components increase. 

A comparison of the multithread cases (Figures 6 and 7, respectively) shows 
that, for all benchmarks (in particular, C4-5 and FFT), the relative memory ac- 
cess component gets reduced when hardware support is present. This is explained 
by the ability of hardware multithreading to hide memory latencies by overlap- 
ping memory accesses from distinct threads. The improved memory behavior 
is reflected in increased processor usage (busy component) and, ultimately, in 
better performance over the software scheme as shown in Figure 4. 

For Radix, the hardware scheme fails to deliver better performance for the 
following reason. In Radix, the increased frequency of context switches causes 
interference in the level-1 cache, increasing the worst-case LI miss rate in pro- 
cessor 0 (HDSM level 1) from 9.7% to 15.1%. 



5 Conclusions 

A heterogeneous, hierarchical organization of processor and memory resources 
of a DSM allows efficient execution of codes with various degrees of parallelism. 
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Execution time components, singie-thread 




10% 

0% 



C4.5 FFT FMM LU Ocean Radix Water 



Fig. 5. Relative contributions of busy, memory and synchronization to total 
execution time of a level- 1 processor under the single-thread assignment. 



This organization also delivers high-performance for unmodified, homogeneous 
shared-memory parallel programs that exhibit a single degree of parallelism. 

Support for the execution of multiple threads in the high-performance pro- 
cessors of a heterogeneous DSM is key to delivering high performance for such 
homogenous parallel applications. This paper shows that the virtual-processor 
assignment of threads to nodes that are heterogeneous only with respect to ILP 
hardware and cache sizes improves the average performance of HDSMs by up to 
45%, when compared to a single-thread assignment policy. 

This paper also shows that hardware support for hardware block multithread- 
ing in the high-performance upper-level processors is desirable for an HDSM or- 
ganization. A simulation analysis shows that hardware multi-threading improves 
the performance of virtually-assigned homogeneous applications in HDSMs by as 
much as 21% (13% on average) over a software-based context-switching scheme. 

A detailed analysis of the execution in the multithreaded upper-level proces- 
sors shows that, while the virtual-processor thread assignment mechanism is able 
to improve load balancing, the hardware multithreading solution is particularly 
effective in overlapping high-latency shared-memory accesses and reducing the 
memory component of the execution time. 
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Execution time components, MT-SW 




■ Sync 

□ Mem 

□ Busy 



Fig. 6. Relative contributions of busy, memory and synchronization to total 
execution time of a level-1 processor under the virtual-processor, software mul- 
tithreading assignment. 
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Abstract. The growing market of embedded systems and applications has led to 
the making of more general embedded processors, with some features 
traditionally associated with general-purpose microprocessors. Following this 
trend, recent research has tried to incorporate into embedded processors the 
newest techniques to break down ILP limits. Value speculation is a recent 
technique not yet considered in the context of embedded processors, and the 
goal of the present work is to analyse the performance potential of this 
technique within this scope. 



1 Introduction 

Over the last few years, the increasing number of communication and multimedia 
applications has brought about a growing demand for high performance in embedded 
computing systems [1], [2], and many of the techniques for extracting Instruction- 
Level Parallelism (ILP), traditionally used in high performance general-purpose 
systems, are being applied to embedded processors [3]. The limits on the amount of 
extractable ILP are due to the program dependencies, and data dependencies present a 
particularly major hurdle. Through value speculation, it is possible to counteract data 
dependencies and thus increase the program’s degree of parallelism. 

The value prediction technique, like branch prediction, allows temporal violation 
of the program constraints without affecting its semantics. Based on the previous 
history of program execution, the hardware predicts at run-time the outcome of an 
instruction, which is used by the consumer instructions when the real data is not yet 
ready. When the true data becomes available, it is compared with the predicted value, 
and in the case of a mismatch, the instructions are re-executed with the correct value. 

In the context of general-purpose microprocessors, the performance potential of 
this relatively recent technique has been shown to be significant in a number of 
studies [4][5], Our intuition is that multimedia and communication programs present a 
more highly predictable (value) behavior than normal programs, due to the nature of 
both the algorithms and the input data. The objective of this work is the application of 
value prediction techniques in the ambit of embedded processors and the 
demonstration of a better efficiency within this scope. 

To achieve this comparative analysis we have collected results for the integer 
SPEC95 and MediaBench [6] benchmarks. We used integer SPEC'95 as an evaluation 
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benchmark in the context of general-purpose systems and MediaBench (composed of 
applications culled from image processing, communications and DSP applications) as 
a representative benchmark set for embedded computing systems. First, we perform a 
predictability analysis, and we prove that the output values of the MediaBench 
programs are, on average, more predictable than the SPEC95 programs, using several 
low-cost configurations of different predictor models. However, predictability results 
are not enough to justify the use of extra hardware to predict values, but it is essential 
to prove that processor performance is also improved. So in addition, we perform 
detailed timing simulations in order to compare the speedup achievable by using 
value prediction in two typical architectures — a high-performance embedded 
processor architecture and a high-performance general-purpose processor architecture 
— , and we prove that, using a low-cost value predictor, an embedded proeessor 
running the MediaBench programs can profit much more from value prediction than a 
general-purpose processor running the SPEC programs. 

The paper is organised as follows. Section 2 summarises the previous work on data 
value prediction. Section 3 describes the experimental framework. Section 4 presents 
a comparative analysis of value predictability for different predictor models. Section 5 
describes the two machine models used in the timing simulations and the speedup 
results. Finally, section 6 presents the conclusions and future work. 



2 Related Work 

Early work on value predietion [7] showed that instructions exhibit a new kind of 
locality, called value locality, whieh means that the values generated by a given static 
instruction tend to be repeated for a large fraction of the exeeution time. This property 
allows the data to be predictable. In a later work, Sazeides et al. [4] state that the 
predictability of a value sequence is a function of the sequence itself and the predictor 
used. In this way, we can find some kinds of predictable sequences, like for example 
the stride sequences, that do not exhibit value locality. 

Most of the value predictors proposed in the literature fit into one of the following 
types: Last-value predictors (EVP), which make a prediction based on the last 
outcome of the same static instruction, and can correctly predict constant sequences of 
data. [7], [8]. Stride predictors (SP), which make a prediction based on the last 
outcome plus a constant stride, and can correctly predict arithmetic sequences of data 
(even constant sequences, whose stride is 0), [8], [9]. Context-based predictors 
(CBP), whieh learn the values that follow a particular context and make a predietion 
based on the last values generated by the same instruction. They can correctly predict 
repetitive sequences of data [4], [8]. Hybrid predictors (HP), which combine some of 
the previous predictors and include a selection mechanism, which is either hardware 
[8], [10], [11], or software [12]. To date, most of the implementations of these 
predictors have been simulated in the context of general-purpose superscalar 
processors using SPEC’95 as the evaluation benchmark suite. The results obtained 
are very promising: on average we can correctly predict about 50% of the output 
values of a program and obtain about a 20% improvement in speedup [10], [11], [4]. 
But to obtain these results, sophisticated and expensive predictors are needed, which 
nowadays are difficult to implement due to the current technology. 
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In the context of embedded processors, we can find several studies which try to 
improve performance by applying techniques traditionally used in the ambit of 
general-purpose processors. However, value prediction is a recent technique not yet 
applied to these kind of processors. The reason for this lies in area restriction, a major 
challenge in embedded systems, which makes unfeasible the inclusion of very 
expensive hardware to predict values in the processor. Nevertheless, this is not the 
case here, since a very small predictor table is needed for this particular kind of 
applications as we will show later. 



3 Experimental Framework 

This section describes the framework employed in our research to obtain the 
experimental results. We performed our experiments on simulators derived from the 
SimpleScalar 3.0 toolset (PISA version) [13], a suite of functional and timing 
simulation tools. 

As we mentioned above, we collected results from the integer SPEC95 and 
MediaBench (MB) [6] benchmarks, whose characteristics are shown in Tables 1 and 2 
respectively. 



Table 1. SPEC95 integer benchmark statistics 



BENCH. 


DESCRIPTION 


INPUT SET 


# INST. 


%LOAD 


%INT 


Compress95 


Data compression 


30000 e 2231 


95 M 


21.35 


46.03 


Ccl 


Compiler 


Ref. Input (gcc.i) 


203 M 


26.05 


39.95 


Go 


Game 


99 


132 M 


20.66 


57.16 


Ijpeg 


Jpeg encoder 


Train Input (specmum.ppm) 


553 M 


17.63 


65.21 


M88ksim 


M88000 Simulator 


Train Input 


120 M 


18.98 


49.82 


Perl 


PERL interpreter 


Train Input (scrabbl.in) 


40 M 


27.83 


34.97 


Li 


LISP emulator 


Train Input 


183 M 


25.90 


34.74 


Vortex 


Data base 


Train Input 


2520 M 


30.67 


30.82 



Table 2. MediaBench suite characteristics 



BENCH. 


DESCRIPTION 


INPUT SET 


#INST. 


%LOAD 


%INT 


Jpeg 


JPEG image comp / decomp 


Testimg.ppm 


20 M 


22.73 


55.75 


Mpeg 


MPEG-2 video encod / decod 


Rec*.YUV 


BOOM 


25.41 


51.69 


Gsm 


GSM speech encod / decod 


Clinton.pcm 


306 M 


14.88 


72.47 


G.721 


Voice comp / decomp 


Clinton .pcm 


546 M 


13.50 


59.13 


Pegwit 


Public key encr / deer 


Pgptest.plain 


50 M 


20.98 


61.28 


PgP 


Public key encr / deer 


Pgptest.plain 


153 M 


17.31 


67.57 


Ghostscript 


PostScript interpreter 


Tiger.ps 


BOOM 


14.31 


56.21 


Mesa 


3-D graphics library 


N/A 


8M 


23.22 


46.10 


Rasta 


Speech recognition 


Ex5_c 1 .wav 


39 M 


21.60 


45.14 


Epic 


Image comp / decomp 


Test_image.pgm 


59 M 


12.87 


53.87 


Adpcm 


Audio encod / decod 


Clinton.pcm 


12 M 


6.79 


62.99 



The ijpeg program belongs to both benchmark suites, but despite the name they are 
quite different, since not only are the library versions different but so too are the ways 
they used. The input files and the program parameters of the test programs are 
different as well. The SPEC95 version of JPEG was modified because the cjpeg and 
djpeg routines, for compression and decompression, required too much acceptable I/O 
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traffic to conform to SPEC CPU guidelines; this was overcome by reading the image 
into a memory buffer, and processing it repeatedly with different compression 
settings. 

The majority of the MediaBench programs are composed of two applications; 
compression/decompression or coding/decoding. We have combined the results for 
the two applications by first executing the compression or coding program and then 
the decompression or decoding, putting the data obtained together. The programs 
were compiled with the gcc compiler included in the tool set, using the optimization 
level 03. Due to time constraints, we have only simulated 100 million instructions. 



4 Predictability Analysis 

In this section we analyze and compare value predictability for the MediaBench and 
SPEC95 benchmark suites. This analysis is based on the percentage of program 
values that can be correctly predicted. Our main purpose is to demonstrate that typical 
embedded applications exhibit a more predictable value behavior than normal 
application, especially for low-cost predictors. 

As mentioned before, the predictability of a value sequence is a function of both 
the sequence itself and the predictor employed. Therefore, in order to accurately 
compare several program sets, it is necessary to carry out experiments for all the 
existing predictor models. Furthermore, we must consider that using idealized 
predictors (infinite tables) it is possible to evaluate the theoretical value predictability 
of programs [4], although this is not our goal. On the contrary, we want to empirically 
assess the program predictability by using realistic and low-cost implementations of 
the predictor models (limited table size). From this pragmatic analysis we should be 
in a position to foresee some of the performance results presented later, and we should 
also be able to select the most suitable value predictor for embedded processors. 



4.1 Predictor Models 

We should first introduce the particular low-cost implementations of the predictor 
models, which are employed in this work. In view of the fact that the last value 
predictions are special stride predictions (with zero stride), only stride, context-based 
and hybrid prediction schemes are considered. An initial analysis of each benchmark 
value behavior is also presented below. 

Stride Predictor Implementation. The SP is implemented by means of a direct 
mapped table. The table is indexed using the least significant bits of the instruction 
PC. Each table entry stores the following information: the last-value produced by the 
instruction (32 bits), the stride between the two last outputs of the instruction (8 bits), 
and the confidence bits. The percentages of values correctly predicted (also called 
predictor efficacy) for both program suites, MediaBench (MB) and SPEC95, are 
shown in figure 1 . 

Looking at the results presented above, the first remark that should be made is that, 
apart from gsm and pegwit, a considerably high percentage of the MB program values 
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could be correctly predicted by the SP (40%-50%) and very small tables are needed to 
achieve these results. Furthermore, except for three of the eleven programs that make 
up the MB suite, almost the same percentage of correct values could be obtained by 
using a 256-entry table or by using a 4K-entry table. On the other hand, looking at the 
results for the SPEC95 benchmarks we can observe an appreciably different behavior. 
For most of the programs the predictor table size has a significant influence on 
efficacy and the results are not particularly outstanding. Nevertheless, the mSSksim 
program exhibits a particularly high value predictability, and thus appreciably raising 
the average results. 



□ 256 D512 B1024 ■2048 B4096 D256 D512 >1024 ■2048 ■4096 




a) MediaBench b) SPEC95 



Fig 1. SP efficacy for 256, 512, IK, 2K and 4K-entry tables 



Context-Based Predictor Implementation. The CBP is derived from the work of 
Sazeides et al. [14] and it uses a 2-level table. The first level table, called the Value 
Flistory Table (VHT) is direct mapped and it is indexed using the least significant bits 
of the instruction PC. This table stores an order-3 context composed by the last-value 
produced by the instruction (32 bits), and two strides between the 3 last outcomes 
produced by the instruction (8 bits each). The second level table, called the Value 
Prediction Table (VPT) is indexed by a hash function, which uses context information 
from the VFIT. The VPT is responsible for storing the value prediction (32 bits) and 
the confidence estimation for each context. The hash function shift-xor-fold (also used 
for indexing the 2”‘* level table in the hybrid predictor), shown in figure 2, differs from 
the original one proposed by Sazeides, and significantly reduces the aliasing in the 
VPT (especially for small tables). 



Order -3 Context 

Stride 1 Stride 0 Last 




Fig 2. CBP hash function 
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Figure 3 presents the efficacy results for several different CBP configurations 
(described in table 3). In contrast to the SP, we now remark on the significant 
influence of table size on predictor efficacy for both sets of benchmarks — increasing 
the size of the prediction table from 256 up to 4K entries doubles the CBP efficacy for 
most of the programs — . It is also important to highlight that, although for the 
SPEC95 suite the results of the CBP seem slightly worst than for the SP, for the MB 
set we appreciate a significant improvement on value predictability, especially for 
gsm, which now exhibits good predictability. 
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Fig 3. CBP efficacy for 256, 512, IK, 2K and 4K-entry VPTs 

Hybrid Predictor Implementation. The traditional approach of implementing hybrid 
predictors is based on dissociated predictors and a selection mechanism. Each 
individual predictor produces its own prediction, and the selection logic is responsible 
for choosing the more suitable for the current instruction. However, when the 
predictable instruction sets of the predictors are highly overlapped, the hardware 
efficiency of this approach is low because it uses duplicated hardware for predicting 
the same instructions. The hybrid predictor employed in this paper is based on a 
previous work presented in [11]. Instead of using dissociated predictors schemes, it 
uses overlapped ones and a finite state machine based on value sequence 
classification, which decides when it is necessary to use each part of the predictor. 
The key idea behind this approach is to use the extra hardware only when it necessary 
for predicting a particular value sequence. This way for constant sequences it only 
uses the last-value table, for stride sequences (not constant) it uses both the last-value 
and stride tables, and for non-stride sequences it uses in addition the second level 
table. Notice that this hybrid predictor only produces a prediction at one time. The 
block and state diagrams of this predictor are shown in figure 4, for more details 
please see [11]. 




Table 3. Predictor configuration 



Predictor 


Configuration 






A 


B 


C 


D 


E 


Stride 


E 


256 


512 


1024 


2048 


4096 


Context 


Evht 


128 


256 


512 


1024 


1024 




Evpt 


256 


512 


1024 


2048 


4096 


Hybrid 


Blast = Estridh 


128 


256 


512 


1024 


1024 




Evpt 


256 


512 


1024 


2048 


4096 
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LAST VALUE 




State 




a) block structure 



b) state diagram 



Fig 4. Hybrid predictor 



Several different configurations are possible for this kind of predictor, since each 
of the tables can be of a different size. In this work we have elected HP configurations 
with the same cost as the CBP. The configurations employed are described in table 3 
and the HP efficacy results are shown in figure 5. 



□ 256 D512 H1024 ■2048 B4096 D256 D512 B1024 >2048 ■4096 




a) MediaBench b) SPEC95 



Fig 5. Hybrid predictor efficacy for 256, 512, IK, 2K and 4K-entry VPTs 

From the results presented above we can comment that, in general, program 
predictability is higher for the hybrid predictor than for other predictors. Nevertheless, 
variations can be observed depending on the suite under consideration. For the MB 
suite, a remarkable increase in predictability can be noticed for all the programs, 
while for the SPEC95 set the previous remark is only true for a few benchmarks. In 
all other aspects, the behavior of the HP is similar to that of the previous predictors. 
With respect to the predictability of pegwit, although significantly better than for the 
SP or CBP, we observe once more that it is particularly poor compared to the other 
programs of the MB set (this is not true if compared to SPEC95 programs). The 
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reason lies in the nature of the program itself. Pegwit is a program for public key 
encryption and its structure has been chosen specifically to avoid redundancy and so 
be resistant to cryptanalysis methods [15]. 



4.2 Comparative Results 

For the sake of highlighting the differences between both benchmark suites, we 
compare the average efficacy results as a function of the predictor cost. Furthermore, 
this comparison also helps us to select the best predictor (i.e. best balance between 
efficacy and cost). 

The most complex structures of the predictor are the prediction tables, and 
therefore we propose using the global table size as a measure of the predictor cost. 
Table 4 describes the formulae used to calculate the overall size of the predictors (E 
represents the number of table entries and N represents the number of entry field bits). 

Table 4. Cost formulae 



Predictor 


Global Table Size 


Stride 


E * (Nvalue + Nstride) 


Context 


Evht * (Nvalue + 2 * Nstride) + Evpt * Nvalue 


Hybrid 


Elast * Nvw-UE + Evpt * Nvalue + 2 * Estride * Nstride 



Figure 6 shows the average results for both sets of programs as a function of the 
cost. We have computed two different means in order to evaluate the uniformity of 
the program suite behavior: the normal average and the so called realistic mean, 
calculated as the arithmetic mean of all programs except those with the best and worst 
behaviors. In general, we observe that in the case of the MediaBench suite, both 
means are practically equal, but for the SPEC95 benchmarks the average is about 5% 
above the realistic mean. This indicates a more homogeneous behavior, in terms of 
value predictability, in the MediaBench set than in the SPEC95 set (which is more 
sensitive to the outstanding behavior of the mSSksim program). 

These results also show that predictability is higher for the MediaBench suite in all 
circumstances, and that the difference between both benchmark sets is more 
prominent for small predictor costs, decreasing as cost grows. This comparatively 
high predictability of the MediaBench programs may lie in the following reasons. 
First, they have, on average, much more integer and less load instructions than the 
SPEC95 programs (see tables 2 and 3) - in fact these instructions are the most 
predictable instructions, as shown in [4] — . Second, MediaBench applications exhibit 
more loop intensive structures and more redundancy in the input data (images, voice, 
video...) than SPEC95 programs. 

Comparing the different predictors in the case of embedded-processors, it is 
obvious that the hybrid predictor exhibits the best balance between effieacy and cost 
and hence it represents the most suitable choice. Otherwise, in the case of general- 
purpose processors, the HP achieves similar results to the SP. 
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a)SP 



b) CBP 




Predictor Cost (Kbytes) 

c) HP 

Fig 6. Comparative results for MB and SPEC benchmarks 



5 Performance Analysis 

From the previous section we can conclude that the MediaBench suite exhibits a 
higher value predictability than SPEC'95. However, to justify the use of the extra 
value prediction hardware, it is essential to prove that the processor performance is 
significantly improved. 

In this section we evaluate the achievable speedup from using value prediction in 
two typical processor architectures: a high-performance embedded processor, and a 
high-performance general-purpose processor. 



5.1 Machine Model 

A detailed description of all the hardware mechanisms involved in the value 
speculation technique is beyond the scope of the present work. We just want to briefly 
introduce the architecture employed in the timing simulations, which is explained in 
more detail in the Technical Report [16]. 

Our baseline architecture, shown in figure 2, is derived from the architecture used 
by the SimpleScalar Out-of-Order simulator [13]. This architecture is based on the 
Register Update Unit (RUU) [17], which is a scheme that unifies the instruction 
window, the rename logic, and the reorder buffer under the same structure. 







190 



Silvia Del Pino et al. 




Fig. 7. Architecture Block Diagram 



Predictor Lookup. The value predictor is accessed in parallel with the instruction 
fetch using the addresses of the instructions fetched in each cycle, and it provides the 
predicted output values (if available) of these instructions. 

Scheduling Policy. The scheduling policy firstly issues the instructions with actual 
operands, and thus instructions with predicted or speculative operands are issued later. 
Within each group, an oldest-instruction-first policy is used. Using this policy, 
speculative instructions are not issued while there are enough non-speculative 
instructions ready to execute, even if these non-speculative instructions are newer 
than the speculative ones. 

Validation and Misprediction Recovery. The process of validation/invalidation of 
speculative instructions is performed during write-back. This process is performed in 
parallel, i.e. all the instructions within a dependence chain can be 
validated/invalidated in a single cycle. The instructions whose operands have been 
validated can commit in the next stage. On the other hand, those instructions whose 
operands have been invalidated must be re-executed. In view of the fact that it is not 
possible to check the validity and re-schedule the invalidated instructions in the same 
eycle, it is obvious that these instructions cannot be re-executed in the next cycle. 
Consequently, they are delayed one cycle in relation to normal execution. 

Baseline Architectures. Table 5 shows the main parameters of the two selected 
architectures: a 4-width embedded processor architecture and a 6-width general- 
purpose architecture. Most of the parameters of these architectures (fetch/decode 
width, issue width, instruction window and LI -cache size) have been taken from two 
highly evolved representative commercial processors: the AMD K6-2E embedded 
processor core [18], and the AMD Athlon general-purpose processor core [19] (notice 
that fetch/decode width refers to RISC instructions). Other parameters, like functional 
units, have been adapted to Simplescalar Simulator, which does not support special 
instructions (like MMX or 3DNow). Furthermore, since value prediction significantly 
increases the pressure on execution units, the number of functional units and memory 
ports has been slightly increased in order to avoid the bottleneck in the execution 
stage. 
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Table 5. Architectural Parameters 



Configuration parameters 


Embedded Processor 


General-purpose Processor 


Fetch/decode width 


4 


6 


Issue width 


6 


9 


Instruction window 


24 


72 


Load Store Queue 


12 


36 


# Integer ALU 


4 


6 


# Integer Multiplier 


1 


2 


# Floating Point ALU 


4 


6 


# Floating Point Multiplier 


1 


2 


# Memory Ports 


2 


3 


LI 1 Cache / LID Cache 


32KB / 32KB 


64KB / 64KB 


LI Latency 


1 


1 


L2 Cache Size 


No 


4MB 


L2 Latency 


- 


6 


Memory Latency 


10 


10 



5.2 Comparative Results 

In the previous section we concluded that the hybrid predictor exhibits the best 
cost/efficacy trade-off. This observation, along with the fact that detailed timing 
simulations take a long time to execute, led us to use only the hybrid predictor to 
show performance results. 

Figure 8 shows the percentage of speedup achievable for both architectures 
(embedded and general purpose) and both benchmark suites (MediaBench and SPEC) 
using the hybrid predictor with various cost configurations (under 16 KBytes), and 
using a 2-bit saturating counter for confidence estimation with a confidence threshold 
equal to 3. Both the average and the realistic mean (eliminating the best and the worst 
eases) are displayed in this figure. 




Predictor Cost (Kbytes) Predictor Cost (Kbytes) 



a) Embedded Processor 



b) General Purpose Processor 



Fig. 8. % Speedup achieved with the hybrid value predictor 



Two main conclusions can be drawn Irom this figure. First, the predictability 
results shown in the previous section have a direct equivalence in the performance 
results, since the speedup obtained for the MediaBench suite, for both architectures 
and all the predictor configurations, is higher than the speedup reached for the SPEC 
suite. Second, the difference between the average and the realistic mean curves for the 
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SPEC benchmarks is much more prominent than the difference between the 
predictability curves shown in the previous section. Therefore, the sensitivity of the 
SPEC suite in the extreme cases has an even higher impact on speedup. This behavior 
is mainly due to the irregular results obtained for the mSSksim benchmark, which 
achieves a much higher speedup than the other benchmarks. On the other hand, 
MediaBench benchmarks exhibit a much more regular behavior, since the difference 
between the average and the realistic mean curves is of little significance. 

We can deduce too that, despite the general-purpose processor having wider fetch 
(4-6) and issue (6-9), together with a larger window (24-72), the speedup obtained 
with the SPEC’95 benchmark is similar for both architectures and is only a little 
better with the MediaBench set for the general-purpose processor. This fact reveals, 
as shown in [16], that processors with a small to medium size instruction window can 
benefit more from value prediction technique. The explanation of this effect is simple. 
With a small to medium window size, the number of independent instructions kept in 
the window are not enough to cover the available issue bandwidth, hence value 
prediction can be efficiently exploited because it allows data dependencies to be 
broken, and a good number of dependent instruction to be issued in parallel. 
However, as the window enlarges, the number of independent instructions kept in the 
window also increases, and hence value prediction becomes less useful, since it is 
easier to find enough independent instructions in the window to feed the issue 
bandwidth. In view of this fact, embedded processors can benefit more from value 
prediction than general-purpose processors, because they usually employ smaller 
windows due to area restrictions (24 and 72, respectively in our architectures). 

Figure 9 highlights the differences in the speedup achieved by using value 
prediction in the two habitual working situations: the embedded processor running 
MediaBench-like applications, and the general-purpose processor running SPEC-like 
applications. 




Predictor Cost (Kbytes) 



Fig. 9. Speedup achievable in habitual working situations (realistic mean) 

These results show that the embedded processor achieves better speedup results 
than the general-purpose processor and therefore, the application of value prediction 
technique is more beneficial in this context for two reasons. First, as we have proved 
throughout this paper, typical applications of embedded systems, like MediaBench 
benchmarks, exhibit a higher value predictability than general-purpose applications. 
Second, as we have mentioned before, the architectures of embedded processors make 
best use of value prediction. 
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PL1-cachex2 M Value Prediction (Hybrid) DU-cache x 2 Mvalue Prediction (Hybrid) 




a) Embedded processor b) General-purpose processor 



Fig. 10. Speedup obtained by doubling the LI -cache and by using value prediction 

A common question many times asked about the use of value prediction is if the 
extra prediction hardware spent could be better employed in other parts of the 
processor, which could yield a higher benefit in the overall performanee — for 
example inereasing the Ll-cache size — . With this idea in mind we performed some 
experiments whose results are displayed in Figure 10. This figure shows the speedup 
obtained by doubling the Ll-cache (both the instruction and data caches) in the 
embedded processor and the general-purpose processor (both processors running the 
MediaBench benchmarks), and it is compared to the speedup obtained by using a 14 
Kbyte hybrid value predictor. 

We can observe that, for both processors, the speedup achievable using the value 
prediction is much higher than increasing the cache size. This difference is more 
prominent in the general-purpose processor (when executing MediaBench), since 
increasing the cache scarcely affects performance. Furthermore, the cost of the 
prediction hardware (14 Kbytes) is much lower than the cost of doubling the Ll-cache 
(64 Kbytes in the embedded processor and, 128 Kbytes in the general purpose 
processor). So, we can conclude that value prediction is a profitable hardware 
investment for processor performance. 



6 Conclusions and Future Work 

The objective of this work is to apply value prediction techniques in the ambit of 

embedded processors and to demonstrate their higher efficiency within this scope. 

The main conclusions that can be drawn from this study are the following: 

• Our initial intuition was verified and we have demonstrated that multimedia and 
communication programs present a more highly predictable value behavior than 
normal programs. Furthermore, a high degree of predictability can be obtained 
using low-cost value predictors, and therefore employing value prediction seems 
reasonable for this particular kind of applications. 

• By means of detailed timing simulations, and using two generic high-performance 
architectures, one for an embedded processor and another for a general purpose 
processor, we have shown that the higher predictability of multimedia and 
communication programs has a direct impact on the performance results, since the 
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speedup obtained for the MediaBench suite, for both architectures and all the 
predictor configurations is higher than the speedup attained for the SPEC suite. 

• In spite of the general-purpose processor having a wider fetch and issue, as well as 
a larger window, the speedup achievable using value prediction in a embedded 
environment is significantly higher. This is due to both the higher value 
predictability of multimedia and communication applications and the lower 
instruction window used in embedded processors, which allows more efficient 
exploitation of value prediction. 

• Finally, we have shown that the speedup obtained by using a hybrid value 
predictor is appreciably higher than the speedup obtained by doubling the Ll- 
cache. These results prove that the hardware invested on value prediction is a 
beneficial expense for the processor performance. 

Nevertheless, this work must be interpreted as a first step towards integrating value 
speculation into embedded processor architecture. We believe that there is 
considerable work to be carried out, especially in relation to performance/cost 
analysis, power-consumption considerations, and confidence estimation. Our future 
research will cover these issues, and also deepen the analysis of the hardware 
mechanisms involved in value speculation. 
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Chapter 2: 

Cellular Automata and Applications in 
Computational Physics 



Introduction 



Dieterich Stauffer’s invited talk, Cellular Automata: Applications opens this 
chapter, which deals mainly with problems of interest to computational physics 
and comprises one invited talk and six selected papers. 

The first selected paper, by Talia, is a manifesto in support of languages for 
parallel programming; it is argued that Fortran or C jointly with MPI are in dis- 
advantage when compared with higher level languages with embedded paradigms 
for cellular programming in the language and the directives of parallelization 
hidden to the user or programmer. 

Kundu, in the second paper, considers the utilization of genetic algorithms for 
identification and extraction of evolvable rules originated from cells in a lattice 
of sites in cellular automata models. 

Pacheco and Martins performed the parallelization of a program for com- 
putation of for instance the total energy of a molecule based on the Density- 
Functional theory; the structural optimization of large molecules is performed 
via a Monte-Carlo simulated annealing strategy. 

Borges and Falcao also rely on Monte Carlo simulation, here applied to relia- 
bility evaluation of electric power systems; the work included the parallelization 
of an algorithm based in sampling the state space for several periods simulating 
a realization of the stochastic process of system operation. 

Numerical simulation of plasma is the subject of the study by Nunn, includ- 
ing the parallelization of the so-called Vlasov Hybrid Simulation method; the 
results were presented concerning the simulation of the generation mechanism 
of triggered emission and chorus in a frequency band between 3 and 30 kHz in 
the Earth’s magnetosphere. 

Vigo- Aguiar et al. solve the radial Schrodinger equation using a parallel mul- 
tistep algorithm; results obtained on a 4-processor machine show speed-ups of 

3.1. 
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Abstract. A review will be given on the simulation of large simple cel- 
lular automata with up to one million times one million elements. Here 
each site of a large lattice carries a binary variable the orientation of 
which depends on its lattice neighbours orientation at the previous time 
step. Single-bit processing allows for high speeds (except for probabilistic 
rules) and saves memory. Geometric parallelization is easy since only a 
small amount of message passing at predictable times is required. Appli- 
cations will emphasize biology: Game of life, ageing, sex. 



1 Introduction 

Parallel computing for cellular automata and Ising-like systems has a 30 year 
old history, long before real parallel machines became widespread. For each vari- 
able then can be stored in a single bit, and by logical bit-by-bit operations 
dealing with 32 variables simultaneously through one single 4-byte command. 
Many aspects of vector and parallel computing, like the division of a lattice 
into sublattices of checkerboard-type, were used in this way before they were 
used on vector computers. Physicists call this method multi-spin coding, and 
Prof. J.A.M.S Duarte is a Porto expert. Since my last review [1] large parallel 
machines became widespread, and also different applications were found. 

Cellular automata are discrete in space, time, and values. For us here we 
assume that each site z of a large lattice carries a variable rzj which is either -|-1 
or -1 (the spin language preferred by physicists), or 1 and 0 in the language of 
computer science which is more appropriate for multi-spin coding. The value rii 
at the next time step t -I- 1 is completely determined by that of its nearest lattice 
neighbours at time step t. 

The next section deals with multi-spin coding on a scalar machine, then we 
deal with domain decomposition of large lattices on parallel computers, and 
finally we summarize some applications with up to 10^^ sites. 



2 Multi-spin Coding 

Let us assume we want to study an infection process. Every site on a large lattice 
is either sick (n = 1) or healthy (n = 0). Sick sites infect their neigbours. Thus 
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rii{t + 1) is sick at the next time step t + 1, if at time t at least one of its 
neighbours was sick. On a one-dimensional chain these two neighbours of site i 
are z — 1 and z -I- 1 which means that a logical OR gives the desired result: 

nnew(i) = n(i-l) .or. n(i+l) 

when in Fortran n and nnew are logical arrays. Now many bits are wasted to 
store the one-bit variables nnew and n in one computer word. If we store 32 such 
variables in one computer word, we rewrite the above statement as 

nnew(i) = ior( n(i-l) , n(i+l)) 

where ior is a bit-string command performing the logical-or operation for each 
pair of bits separately. Thus one command does 32 (or 64) operations in paral- 
lel on a scalar computer. In the programming language C the same bit-by-bit 
commands are part of the standard, using different symbols for the operations. 

If of the chain we would store sites 1 to 32 in the first 4-byte integer n(I), 
sites 33 to 64 in n(2), etc, then the above statement would not deal with the 
nearest neighbours of site z. Therefore, if we use LL = L/32 words n(I), n(2), ... 
n(LL) for L sites, we store site I in the first word, site 2 in the second, ..., site 
LL in the last word, in the first bit position. Then sites LL -\- 1, LL -|- 2, ..., 2LL 
are stored in the second bit position of the same words n(I), n(2), ... n(LL), then 
2LL + 1 to 3LL in the third bit position, until the last bit of the last word n(LL) 
is filled with site L. Then the above statement really works for z = 2; for the 
extreme words n(l) and n(LL) the left and right neighbours are n(LL) and n(I), 
respectively, shifted circularly to the left or right by one bit. In d dimensions, the 
integer array n needs a second index going from I to L'^~^ as usual. Complete 
Fortran programs are given in [1] . 

In the analysis of simulated configurations it is very helpful to have a function 
computing the number of bits which are set. Fortunately, the late Seymour Cray 
was aware of that problem and gave us this function under the name popcnt. 

These programs are also vectorized and speeds of the order of 10® sites could 
be reached already a decade ago on one vector processor. The next section de- 
scribes the parallelization. 

3 Parallel Computers 

While multi-spin coding allows parallel treatment of 32 variables, we can get 
additional speed on parallel computers with many processors, distributed mem- 
ory, and message passing, by using simultaneously all these processors on one 
large lattice. (How to do different lattices on different processors by replication 
presumably does not have to be explained to this conference.) Since I do not 
have access to a multitude of such parallel machines, I just learned the machine- 
dependent message passing routines on the machine for which I had the account, 
and since 1996 this is a Cray-T3E with 64 bits per word. Message passing com- 
mands start with shmem. 
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do 3 itime=l ,max 
info = barrier 0 

if (node .gt . 0) call shmem_get(n(l,0) ,n(l,Lstrip) ,LL, node-1) 
if (node . eq. 0) call shmem_get(n(l,0) ,n(l,Lstrip) ,LL, np -1) 
info = barrier 0 
do 6 j=l,Lstrip 

c periodic boundaries left and right via circular shift 

n(0, j)=ior(ishft(n(LL, j) ,-l) , ishft (n(LL, j ) ,63)) 

6 n(LLl , j )=ior (ishft (n(l , j ) ,1) , ishft (n(l , j ) , -63) ) 
nch=0 

do 7 j=l,Lstrip 
if (j .eq.2) then 
info = barrierO 

if (node . It .np-1) call shmem_get (n(l ,Lp) ,n(l , 1) ,LL,node+l) 
if (node . eq.np-1) call shmem_get (n(l ,Lp) ,n(l , 1) ,LL, 0 ) 

info = barrierO 
end if 
do 7 i=l,LL 
nl=n(i, j-1) 
n2=n(i, j+1) 
n3=n(i-l, j) 
n4=n(i+l, j) 
n5=n(i, j) 
nl2=ior (nl ,n2) 

n(i, j)=ior(ior(iand(nl,iand(n2,n3)) ,iand(n5, 

1 iand(n4,ior(nl2,n3)))) ,iand(ior(n5,n4) , 

2 ior(iand(nl2,n3) , iand(nl ,n2) ) ) ) 

7 if (n(i , j ) .ne .n5) nch=nch+popcnt(ieor(n(i, j) ,n5)) 
info = barrierO 

if (node . eq. 0) then 
do 8 iadd=l,np-l 

call shmem_get (idummy,nch, 1, iadd) 

8 nch = nch + idummy 

endif 

info = barrierO 

call shmem_get (nch,nch, 1 , 0) 

info = barrierO 

if (node . eq. 0) print *, itime,nch 
if(nch.eq.O) goto 9 
3 continue 

9 continue 



Here shmem get (target, source, length, node) gets from the processor 
with number node the information starting there with the word source and 
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extending over length words in total. It stored it in the memory of the cur- 
rent processor, with the first memory location called target. For example, call 
shmem get(n(l,0) ,n(l,Lstrip) ,LL, node-1) gets from processor node-1 to 
the present node the LL words n(l,Lstrip) to n(LL,Lstrip) and stored them in 
the words n(l,0) to n(LL), 0). 

We divide a large L x L square lattices into Np strips of length L and width 
Lstrip = L/Np. Each strip is stored on one processor; in addition, each processor 
stores the lowest lattice line of the strip on the upper processor, and the highest 
lattice line of the strip on the lowest buffer. These two buffers are updated after 
every iteration via the shmem get command. Thus message passing happens at 
predictable times, and the amount of transferred information is much smaller 
than the total amount of stored information provided Lstrip 3> 1. 

The sample program core simulated the Griffeath majority rule on the square 
lattice: A spin is flipped if and only if more than half of its four neighbours point 
into the opposite direction. Loop 7 mostly translates these words into logical 
statements for the bit-by-bit operations. To see if a stable configuration is reached 
which will remain unchanged forever we calculate the number nch of sites which 
have flipped. If this number, summed over all processors, is zero, then we can 
stop the iteration. This particular cellular automata rule [2] was selected since 
the simulation indeed comes to a stop after a moderate number of iterations, 
thus allowing the simulation of L = 10® with moderate computing time. (For 
simplicity, the program uses sequential instead of simultaneous updating, info 
= barrier 0 forces synchronization of all processors. There are more elegant 
ways than loop 8 to sum over all processors.) 

4 Applications 

4.1 With Multi-spin Coding 

One of the most famous applications of cellular automata are Frisch-Hasslacher- 
Pomeau lattice gases for hydrodynamics on a triangular lattice. However, in 
recent years the emphasis in this area seems to have shifted to the lattice Boltz- 
mann equation which goes beyond cellular automata and is thus not reviewed 
here [3]. 

Immunology was simulated with cellular automata, sometimes using the 
above methods for huge lattices; but no consensus is evident from the litera- 
ture which automata rule is best; thus we refer to [4]. 

(In this immunological context, a vectorization technique was developed 
which allows to write one Fortran program for general dimension, working e.g. 
on the square lattice, the simple cubic lattice, and the four- or five-dimensional 
hypercubic lattice, though not with one bit per spin. The number of neighbours 
in d dimensions is 2d, and an inner loop over such a small number of neighbours 
would be very inefficient. Thus the inner loop went through all lattice sites. In 
the loop body, for every direction one line was added which was executed if 
the dimensionality d was large enough. Such if-statements again normally are 
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deadly for efficient vectorization. However, by denoting c? as a fixed constant 
idim through parameter (idim=3) , the if-conditions were evaluated at compile 
time and the loop was efficiently vectorized.) 

The Game of Life has fascinated many through its variety of configurations. 
It uses the 8 nearest and next nearest neighbours of the center site. If the center 
site is empty it becomes occupied at the next time step if and only if three 
neighbours are occupied; if the center is occupied it remains so for the next time 
step if and only if two or three of its eight neighbours are occupied. A multi-spin 
coding program was published by Gibbs [5] though Franco Bagnoli (priv.comm.) 
has a faster one. Fig.l shows how the final density of occupied sites depends on 
the initial density [5]. Large lattices confirmed the theoretically expected power 
laws here for both low and high densities. 



Game of Life up to 160,000 * 160,000: concentrations in percent 




0 1 1 1 1 1 1 1 1 ^'^'ii»o*kx)ooooo — I 

0 10 20 30 40 50 60 70 80 90 100 

initial concentration 



Fig. 1. Variation of equilibrium density in Game of Life with initial density if 
the sites initially are occupied randomly. [5] 



Much simpler are the Q2R cellular automata approximating the Ising model. 
Each spin flips if and only if it has as many up as down neighbours. Thus, if 
interpreted as an Ising magnet, the spin flips if and only if such a flip does not 
change the energy. We have here a reversible microcanonical and non ergodic 
algorithm which nevertheless numerically gives the correct spontaneous magne- 
tizations in two and three dimensions. The dynamical behavior, however, is not 
understood [6]. This algorithm is a special case of the Creutz demon method 
recently reviewed by Aktekin [7]; we merely let the size of the energy reservoir 
of the demons go to zero. Careri [8] pointed out a possible biological application 
of Q2R. 
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4.2 Other Bit-Strings 

In the above applications, all bits within one word could be treated efficiently 
by multi-spin coding since they all played the same role and thus were treated 
in parallel. This is no longer the case when the position of a bit has a special 
meaning. In the Penna model of biological ageing [9] , the bit position corresponds 
to a “year” or other time unit in the life of the individual: A bit set at one year 
means that from this year on until the death of the individual a dangerous 
inherited disease affects the health; three or more such active diseases kill. Thus 
the bit-string in this Penna model symbolizes the survival aspects of the genome; 
the age of genetic death is stored in it from birth, but at present we know only 
those inherited diseases which have become active. (See the movie Gattaca on 
the question whether a future is desirable when we can interpret the human 
genome such that we know all these genetic defects long before they become 
active.) 

Thus each individual, characterized by a string of 32 bits, lives until three set 
bits kill it. Before, if gives birth provided it has reached the minimum reproduc- 
tion age; for each child a random mutation sets one of the bits to one compared 
with the bit-string of the parent. To avoid an exponential growth of the popu- 
lation, a Verhulst factor like in the logistic equation limits the population from 
above. Now we no longer can deal with the bit-string through multi-spin coding 
in the above parallel sense, since a bit for year 2 plays a different role than a bit 
for year 30. But the bit-handling techniques are useful for both methods. 

Numerous simulations of this model, as reviewed recently [10], gave agree- 
ment with the Gompertz law of a mortality function increasing exponentially 
with age, or with the lifestyle of the Pacific Salmon who dies shortly after mar- 
riage. Very recently [II] it was pointed out that in experiments with flies one may 
have a genetically homogeneous population but still a Gompertz law whereas the 
Penna model would in this case predict all genetic deaths at the same age; this 
critique was combined with a more complicated model avoiding this disadvan- 
tage. 

For parallel computing it is not only easier but also better to simulate Np 
separate populations on Np processors in parallel, than to distribute one popu- 
lation among the different processors. In the latter case, after some time one of 
the processors, which happened to have the fittest ancestors, will carry all the 
individuals and the others none, if no load balancing [12] is made. 

The above algorithm refers to asexual cloning; sexual reproduction combines 
one female bit-string and one male bit-string to give the genome of the child. 
Compared to cloning, sexual reproduction has the immediate advantage of avoid- 
ing the dangers of a hereditary disease: If this disease is recessive, as are most 
mutations, and if only one of the two parents has it, then the child’s health is 
not affected by it. In other words, sexual reproduction as opposed to cloning 
allows for redundant information just like back-up diskettes. If an error gets into 
the hard disk (female genome), the diskette (male genome) still has the correct 
information. Similarly, repeated proofreading [13] avoids more error than proof- 
reading just once. One of the main successes of the sexual Penna model was 
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an explanation why females (as opposed to Pacific Salmon) survive menopause 
and why menopause exists at all for mammals [10]. It explained why men live 
shorter than women, opposite to the situation with birds [14]. The simulations 
also warn of future disappointments with medical care [15]. 



Sexual Penna model; b=8 no selection (top), b=4 no selection (bottom), b=4 with selection (middle) 




time 



Fig. 2. Initial simulation (top), population with half the birth rate to simulate 
sexual reproduction (bottom), and somewhat improved results if females select 
only healthy partners (middle). 



However, what about life-forms with two genomes but without sexual repro- 
duction (also known as meiotic parthenogenesis) . The above arguments make in 
this case the transmission of genetic information as reliable as in the sexual case, 
while the men do not get pregnant and just eat the steaks and drink the port 
wine away from the mothers. Why do we men exist at all ? As protection against 
parasites [16] ? It helps little if after thousand generations the greater genetic 
variety allows better adjustment to an environmental catastrophe when during 
the waiting time at each generation meiotic parthenogenesis wins by a factor of 
two compared to sexual reproduction [10,17]. Figure 2 shows with the highest 
population the meiotic parthenogenesis; then for sexual reproduction the birth 
rate is reduced by a factor two to account for lazy men (lowest population), and 
finally, for the middle curve, we assume that females select only the healthier 
males (less mutations) as sexual partners. We see from the figure that female 
selection may help, but not enough to overcome the loss of half the births. Other 
explanations [18] seem needed. 

Thus, perhaps men are an error of nature: “When God created Adam, She 
was only trying out.” 





206 



Dietrich Stauffer 



References 

1. Stauffer, D.: Computer simulations of cellular automata. J.Phys. A 24, 909 (1991); 
de Oliveira, P.M.C.: Computing Boolean Statistical Models, World Scientific, Singa- 
pore 1991 

2. Stauffer, D,: Simulation of Griffeath majority rule on large square lattice. Int. J. 
Mod. Phys. C 8, 1141 (1997) 

3. Boghosian, B. and Yeomans, J. : Proc. 7th Int. Conf. Discrete Simulation of Fluids, 
Oxford, July 1998, Int. J. Mod. Phys. C 9, 1123-1605 (1998) 

4. Lippert, K. and Behn, U.: Modelling the Immune System: Architecture and Dynam- 
ics of Idiotypic Networks, page 287 in: Annual Reviews of Computational Physics, 
Vol. V, World Scientific, Singapore 1997; Zorzenon dos Santos, R.M.: Immune Re- 
sponses: Getting close to experimental results with cellular automata models, ibid., 
Vol. VI, page 159 (1998) 

5. Gibbs, P. and Stauffer, D.: Search for Asymptotic Death in Game of Life. Int. J. 
Mod. Phys. C 8, 601 (1997); Malarz, K. et al: Some new facts of Life. Int. J. Mod. 
Phys. C 9, 449 (1998) 

6. Stauffer, D.: Critical 2D and 3D dynamics for q2r cellular automata. Int. J. Mod. 
Phys. C 8, 1263 (1997) 

7. Aktekin, N.: page 1 in: Annual Reviews of Computational Physics, Vol. VII, World 
Scientific, Singapore 2000. 

8. Careri, G. and Stauffer, D.: Ising cellular automata for proton diffusion on protein 
surfaces. Int. J. Mod. Phys. G 9, 675 (1998) 

9. Penna, T.J.P.: J. Statist. Phys. 78, 1629 (1995) 

10. Moss de Oliveira, S., de Oliveira, P.M.C, and Stauffer, D.: Evolution, Money, War 
and Computers, Teubner, Stuttgart-Leipzig, 1999 

11. Pletcher, S. and Neuhauser, G.: Biological aging - Criteria for modelling and a new 
mechanistic model. Int. J. Mod. Phys. C 11, 525 (2000) 

12. Meisgen, F.: Dynamic load balancing for simulations of biological aging. Int. J. 
Mod. Phys. C 8, 575 (1997) 

13. Morris, J.A.: Medical Hypotheses 49, 159 (1997) 

14. Paevskii, V.A.: Demography of Birds (in Russian), Nauka, Moscow 1985 

15. Niewczas, E., Cebrat S., and Stauffer, D.: The influence of the medical care on 
the human life expectancy in 20 th century and the Penna ageing model. Theory in 
Biosciences 119, 122 (2000). 

16. Hamilton, W.D., Axelrod, R., and Tanese, R.: Sexual reproduction as an adaption 
to resist parasites. Proc. Natl. Acad. Sci. USA 87, 3566 (1990); Howard, R.S. and 
Lively C.M.: Parasitism, mutation accumulation and the maintenance of sex. Nature 
367, 554 and 368, 358 (E) (1994); Sa Martins, J. S.: Simulated co-evolution in a 
mutating ecology. Phys. Rev. E 61, R 2212 (2000); Doncaster, C.P., Pound, G.E. 
and Cox, S.J.: The ecological cost of sex. Nature 404, 281 (2000). 

17. Stauffer, D.: Why care about sex ? Some Monte Carlo justification. Physica A 273, 
132 (1999) 

18. Orcal, B., Tuzel, E., Sevim, V., Jan, N. and Erzan A., Int. J. Mod. Phys. C 11, 
No. 5 (2000). 




The Role of Parallel Cellular Programming in 
Computational Science 



Domenico Talia 
ISI-CNR 

Via P. Bucci, cubo 41-C 
87036 Rende, CS, 
Italy 

taliaOsi . dels . unical . it 



Abstract. Cellular automata provide an abstract model of parallel com- 
putation that can be effectively used for modeling and simulation of com- 
plex phenomena and systems. The design and implementation of parallel 
languages based on cellular automata provide useful tools for the devel- 
opment of scalable algorithms and applications in computational science. 
We discuss here the use of cellular automata programming models and 
tools for parallel implementation of real-life problems in computational 
science. Cellular parallel programming tools allow for the exploitation 
on the inherent parallelism of cellular automata in the efficient imple- 
mentation of natural solvers that simulate dynamical systems by a very 
large number of simple agents (cells) that interact locally. As a practi- 
cal example, the paper shows the design of parallel cellular programs by 
a language called CARPET and discusses other languages for parallel 
cellular programming. 



1 Introduction 

Cellular automata (CA) offer a computational model that, because its simplicity 
and generality, has been utilized in many and disparate scientific areas such as 
fluid dynamics, artificial life, image processing, parallel computing, biology, eco- 
nomics and data encryption. The use of cellular automata has been widened by 
their implementation on high-performance parallel architectures that allowed for 
their use on solving very complex problems. Several languages and tools have 
been developed for programming cellular automata on sequential and parallel 
machines. They can support and improve the design and implementation of com- 
plex applications and systems using the cellular automata paradigm. This paper 
presents and discusses cellular automata programming languages and models for 
parallel implementation of real-life problems in computational science. 

A cellular automaton consists of a lattice of cells, each of which is connected 
to a finite neighborhood of cells that are nearby in the lattice [14]. Each cell in 
the regular spatial lattice can take any of a finite number of discrete state values. 
Time is discrete, as well, and at each time step all the cells in the lattice are 
updated by means of a local rule called transition function, which determines 
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the cell’s next state based upon the states of its neighbors. That is, the state of 
a cell at a given time depends only on its own state and the states of its nearby 
neighbors at the previous time step. Different lattice topologies (e.g., triangular, 
square, and hexagonal) and neighborhoods can be defined for an automaton. 

Cellular automata provide a global framework for the implementation of par- 
allel programs that represent natural solvers of dynamic complex phenomena and 
systems based on the use of discrete time, discrete space and a discrete set of 
state variable values. CA are intrinsically parallel and they can be efficiently 
mapped onto parallel computers , because the communication flow between pro- 
cessors can be kept low. Inherent parallelism and restricted communication are 
two key points for the efficient execution of CA on parallel computers. Applica- 
tions of CA are very broad, ranging from the simulation of artificial life, physical, 
biological and chemical phenomena to the modeling of engineering problems in 
many fields such as road traffic, image processing, and science of materials. In the 
past 20 years there has been a significant increase of research activities concern- 
ing both theoretical aspects and practical implementations and use of cellular 
automata as a model for complex dynamics [16] [12]. 

In the cellular programming approach, a cellular algorithm consists of the 
transition function of cells that compose the CA lattice. The transition function 
of each cell is executed in parallel, thus the global state of the the entire automa- 
ton is updated at each iteration. The same local rule is generally used for all 
the cells {homogeneous cellular automata), but it is also possible to define some 
cells with different transition functions {inhomogeneous cellular automata). 

In general, traditional languages such as C, Pascal, C-|— I- and Fortran are 
used in sequential implementations of cellular automata simulations. When a 
parallel implementation is provided, these languages are typically used together 
with parallel toolkits such as MPI and PVM. An alternative to this conservative 
approach is to use CA languages that can express directly in their constructs 
the definition of CA lattices and cellular algorithms. After the program writing, 
a compiler translates these CA rules into a simulation program. This approach 
has a programming advantage offering high-level CA operations and the same 
CA description could possibly also be compiled onto different computers. 

Our opinion is that it is necessary and very useful to develop high-level lan- 
guages and tools specifically designed to express the semantics of the cellular 
automata computational model. In particular, the design and implementation of 
parallel languages based on the cellular automata model provide high-level pro- 
gramming tools for the development of natural solvers in computational science, 
that is scalable algorithms and applications based on a nature-inspired model 
such as cellular automata. In the recent years several CA-based languages have 
been developed and used for designing computational science applications. This 
paper discusses the role these languages may play in the parallel scientific ap- 
plications arena. Furthermore, we show as a case study of this approach, the 
design of parallel cellular programs by the CARPET language and discuss other 
languages for parallel cellular programming such as CDL, Parcel- 1, CANT, and 
Cellang. Because of space limits we cannot describe in detail all the languages. 
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therefore we outline their main features by discussing the paramount aspects of 
the parallel cellular programming languages class. 

The remainder of the paper is organized as follows. Section 2 introduces and 
discusses the main features of parallel cellular languages. Section 3 gives a brief 
description of the CARPET language and section 4 shows the use of CARPET 
for implementing scientific applications according to the cellular programming 
model; some figures are presented to show performance scalability. Finally, sec- 
tion 5 draws some conclusions. 



2 Languages for Parallel Cellular Computing 

The aim of the paper is to discuss how cellular programming languages can 
support users in the implementation of computational science applications. This 
class of applications require the use of high-performance computers to get results 
in a reasonable amount of time. For this reason we restrict the discussion on 
cellular languages that are have been implemented on parallel computers. 

For the implementation of CA on parallel computers two main approaches 
can be used. One is to write programs that encode the CA rules in a general- 
purpose parallel programming language such as HPF, HPC-I--I-, Linda or CILK 
or still using a high-level sequential language like C, Fortran or Java with one of 
the low-level toolkits/libraries currently used to implement parallel applications 
such as MPI, PVM, or OpenMP. This approach does not require a parallel 
programmer to learn a new language syntax and programming techniques for 
cellular programming. However, it is not simple to be used by programmers that 
are not experts in parallel programming and coded programs consist of a large 
number of instructions even if simple cellular models must be implemented. The 
other possibility is to use a high-level language specifically designed for CA, in 
which it is possible to directly express the features and the rules of CA, and then 
use a compiler to translate the CA code into a program executable on parallel 
computers. This second approach has the advantage that it offers a programming 
paradigm that is very close to the CA abstract model and that the same CA 
description could possibly also be compiled into different code for various parallel 
machines. Furthermore, in this approach parallelism is transparent from the user, 
so the programmers can concentrate on the specification of the model without 
worrying about architecture related issues. In summary, it leads to the writing of 
software that does express in a natural way the cellular paradigm, thus programs 
are more simple to read, change, and maintain. On the other hand, the regularity 
of computation and locality of communication allow CA programs to get good 
performance and scalabiltity on parallel architectures. 

Several CA programming languages such as Cellang [3], CARPET [10], CDL 
[6], CANL [7], Parcel-1 [13], DEVS-C-f-h [18], and CEPROL [8], have been 
designed for parallel cellular computing in the past years. These languages sup- 
port the definition of cellular algorithms and their execution on different classes 
of parallel computers. They have several shared features such as the common 
computational paradigm and some differences such as, for example, different 
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constructs to specify details of a cellular automaton, of cell mapping and output 
visualization [17]. Many real-world applications in science and engineering, such 
as lava-flow simulations, molecular gas simulation, landslide modeling, freeway 
traffic flow, 3-D rendering, soil bioremediation, biochemical solution modeling, 
and forest fire simulation, have been implemented by using these CA languages. 
Moreover, parallel CA languages can be used to implement a more general class 
of fine grained applications such as finite elements methods, partial differential 
equations and systolic algorithms. 

Here we discuss the main features of those languages. In particular, we outline 
the following aspects that influence the way in which CA applications can be 
developed on high performance architectures: 

1. Programming approach, 

2. Cellular lattice declaration, 

3. Cell state definition and operations, 

4. Neighborhood declaration and use, 

5. Parallelism exploitation, 

6. Cellular automata mapping, and 

7. Output visualization, 

By discussing these concepts we intend to illustrate how this class of languages 
can be effectively used to implement high-performance applications in science 
and engineering using the massively parallel cellular approach. 



2.1 Programming Approach 

When a programmer starts to design a parallel cellular program she/he must 
define the structure of the lattice that represents the abstract model of a com- 
putation in terms of cell-to-cell interaction patterns. Then it must concentrate 
on the unit of computation that is a single cell of the automaton. The computa- 
tion to be performed must be specified as the evolution rule (transition function) 
of the cells that compose the lattice. Thus, differently form other approaches, a 
user do not specify a global algorithm that contains the program structure in 
an explicit form. The global algorithm consists of all the transition functions of 
all cells that are executed in parallel for a certain number of iterations (steps). 
It is worth to notice that in some CA languages it is possible to define transis- 
tion functions that change in time and space to implement inhomogeneous CA 
computations. Thus, after defining the dimension (e.g., 1-D, 2-D, 3-D) and the 
size of the CA lattice, she/he needs to specify, by the conventional and the CA 
statements, the transition function of the CA that will be executed by all the 
cells. Then the global execution of the cellular program is performed as a mas- 
sively parallel computation in which implicit communication occurs only among 
neighbor cells that access each other state. 
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2.2 Cellular Lattice Declaration 

As mentioned in the previous section, the lattice declaration defines the lattice 
dimension and the lattice size. Most languages support two-dimensional rect- 
angular lattices only (e.g., CANL and CDL). However, some of them, such as 
CARPET and Cellang, allow the definition of 1-D, 2-D, and 3-D lattices. Some 
languages allow also the explicit definition of boundary conditions such as CANL 
[7] that allows adiabatic boundary conditions where absent neighbor cells are as- 
sumed to have the same state as the center cell. Others implement reflecting 
conditions that are based on mirroring the lattice at its borders. Most languages 
use standard boundary conditions such as fixed and toroidal conditions. 



2.3 Cell State 

The cell state contains the values of data on which the cellular program works. 
Thus the global state of an automaton is defined by the collection of the state 
values of all the cells. While low-level implementations of CA allow to define the 
cell state as a small number of bits (typically 8 or 16 bits), cellular languages 
such as CARPET, CANL, DEVS-C-f-l- and CDL allows a user to define cell 
states as a record of typed variables as follows: 

cell = (direction : int ; 

speed : float) ; 

where two substates are declared for the cell state. According to this approach, 
the cell state can be composed of a set of sub-states that are of integer, real, 
char or boolean type and in some case (e.g., CARPET) arrays of those basic 
types can also be used. Together with the constructs for cell state definition, CA 
languages define statements for state addressing and updating that address the 
sub-states by using their identifiers; for example cell . direction indicates the 
direction sub-state of the current cell. 



2.4 Neighborhood 

An important feature of CA languages that differentiate them from array-based 
languages and standard data-parallel languages is that that they do not use 
explicit array indexing. Thus, cells are addressed with a name or the name of 
the cells belonging to the neighborhood. In fact, the neighborhood concept is 
used in the CA setting to define interaction among cells in the lattice. In CA 
languages the neighborhood defines the set of cells whose state can be used in the 
evolution rule of the central cell. For example, if we use a simple neighborhood 
composed of four cells we can declare it as follows 



neigh cross = (up, down, left, right); 
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and address the neighbor cell states by the ids used in the above declaration 
(e.g., down, speed, left . direction). The neighborhood abstraction is used to 
define the communication pattern among cells. It means that at each tipe step, 
a cell send to and receive from the neighbor cells the state values. In this way 
implicit communication and synchronization are realized in cellular computing. 
The neighbor mechanism is a concept similar to the region construct that is used 
in the ZPL language [2] where regions replace explicit array indexing making 
the programming of vector- or matrix-based computations simpler and more 
concise. Furthermore, this way of addressing the lattice elements (cells) does 
not require compile-time sophisticated analysis and complex run-time checks to 
detect communication patterns among elements. 

2.5 Parallelism Exploitation 

CA languages do not provide statements to express parallelism at the language 
level. It turns out that a user does not need to specify what portion of code must 
be executed in parallel. In fact, in parallel CA languages the unit of parallelism is 
a single cell and parallelism, like communication and synchronization, is implicit. 
This means that in principle the transaction function of every cell is executed 
in parallel with the transaction functions of the other cells. In practice, when 
coarse grained parallel machines are used, the number of cells N is greater than 
the number of available processors P, so each processor executes a block of N/P 
cells that can be assigned to it using a domain decomposition approach. 



2.6 CA Mapping 

Like parallelism and communication, also data partitioning and process-to-pro- 
cessor mapping is implicit in CA languages. The mapping of cells (or blocks of 
them) onto the physical processors that compose a parallel machine is generally 
done by the run-time system of each particular language and the user usually 
intervenes in selecting the number of processors or some other simple parameter. 
Some systems that run on MIMD computers use load balancing techniques that 
assign at run-time the execution of cell transition functions to processors that 
are unloaded or use greedy mapping techniques that avoid some processor to 
become unloaded or free during the CA execution for a long period. Example of 
these techniques can be found in [15], [6] and [Ij. 

2.7 Output Visualization and Monitoring 

A computational science application is not just an algorithm. Therefore it is not 
sufficient to have a programming paradigm for implementing a complete appli- 
cation. It is also as much significant to dispose of environments and tools that 
help a user in all the phases of the application development and execution. Most 
of the CA languages we are discussing here provide a development environment 
that allows a user not only to edit and compile the CA programs. They allow 
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to monitor the program behavior during its execution on a parallel machine, by 
visualizing the output as composed of the states of all cells. This is done by dis- 
playing the numerical values or by associating colors to those values. Examples 
of these parallel environments are CAMEL for CARPET, PECANS for CANL, 
and DEVS for DEVS-C-I--I-. Some of these environments provide dynamical vi- 
sualization of simulations together with monitoring and tuning facilities. Users 
can interact with the CA environment to change values of cell states, simulation 
parameters and output visualization features. These facilities are very helpful 
in the development of complex scientific applications and make possible to use 
those CA environments as real problem solving environments (PSEs) [4]. 



Here we addressed the most important aspects that concern the CA software 
development process from problem specification to execution and simulation tun- 
ing. In the next sections we use the CARPET language as a case-study language 
to describe in practice how cellular languages can support the development of 
computational science applications. 



3 CARPET: A High-Level Cellular Language 

CARPET implements the main CA features in a high-level programming lan- 
guage to assists parallel cellular algorithms design without apparent parallelism 
[10]. In particular, CARPET has been used for programming cellular algorithms 
in the CAMEL (Cellular Automata environMent for systEms ModeLing) parallel 
environment [1]. CAMEL provide a software environment designed to support 
the parallel execution of cellular algorithms, the visualization of the results, and 
the monitoring of the program execution. CARPET and CAMEL have been used 
for implementing high-performance simulations of lava flows, landslides, freeway 
traffic, and soil bioremediation [11]. 

The execution of cellular algorithms is implemented by the parallel execution 
of the transition function of every cell according to the Single Program Multiple 
Data (SPMD) model. In this way CAMEL exploits the computing power of a 
highly parallel computer, hiding the architecture issues from a user. A CARPET 
user can design cellular programs describing the actions of many simple active 
elements (implemented by the cells) interacting locally. Then, the CAMEL sys- 
tem allows a user to observe the global complex evolution that arises from all 
the local interactions. 

According to the SPMD programming approach, a user must define by CAR- 
PET the transition function of a single cell of the system she/he wants to 
simulate, then the language run-time system executes in parallel the transition 
function to update the state of each cell at the same time. The main features 
of CARPET are the possibility to describe the state of a cell as a record of 
typed substates, each one by a user-defined type, and the simple definition of 
complex neighborhoods (e.g., hexagonal) that can be also time dependent in a 
n-dimensional discrete Cartesian space. 
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By CARPET, a variety of cellular algorithms can be designed in a simple 
but very expressive way. The language utilizes the control structures, the types, 
the operators and the expressions of the C language and it enhances the declara- 
tion part allowing the declaration of the features of a cellular automaton. These 
are the dimensions of the automaton (e.g., the declaration dimension 3; de- 
fines a three dimensional automaton), the radius (radius) of the neighborhood 
and the pattern of the neighborhood (neighbor). For example, a very simple 
neighborhood composed of four cells can be defined as follows: 

neighbor Stencil [4] ([1,0] Left, [-1,0] Right, [0,-1] Up, [0,1] Down); 

As mentioned before, the state (state) of a cell is defined as a set of typed 
substates that can be shorts, integers, floats, char, and doubles or arrays of 
these basic types. In the following example, the state consists of three substates. 

state (float speedX, speedY, energy); 

The energy substate of the current cell can be referenced by the prede- 
fined variable cell_energy. The neighbor declaration assigns a name to speci- 
fied neighboring cells of the current cell and allows such to refer to the value of 
the substates of these identified cells by their name (e.g., Left_energy). Further- 
more, the name of a vector whose length is the number of elements composing the 
logic neighborhood it must be associated to the neighborhood (e.g.. Stencil). 
The name of the vector can be used as an alias in referring to the neighbor cells. 
Through the vector, a substate can be referred as Stencil [i] .energy where 
0 < [ < 4. 

To guarantee the semantics of cell updating in cellular automata the value of 
one substate of a cell can be modified only by the update operation, for example 

update (cell.speedX, 13.4); 

After the execution of an update statement, the value of a substate argument 
remains unchanged in the current iteration. The new value takes effect at the 
beginning of next iteration. Furthermore, a set of global parameters (parameter) 
can be declared to define global characteristics of the system to be simulated 
(e.g., the permeability of a soil). Finally, CARPET allows users to dehne cells 
with different transition functions (inhomogeneous CA) by means of the GetX, 
GetY, Getz functions that return the value of the coordinate X, Y, and Z of the 
cell in the automaton. By varying only a coordinate it is possible, for example, 
to associate the same transition function to all cells belonging to a plane in a 
three dimensional automaton. 

The language does not provide statements to conhgure the automata, to 
visualize the cell values or to define data channels that can connect the cells 
according to different topologies. The configuration of a cellular automaton is 
defined by the graphical user interface (UI) of the CAMEL environment. The 
UI allows, by menu pops, to define the size of the cellular automata, the number 
of the processors onto which the automata must be executed, and to choose the 
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colors to be assigned to the cell substates to support the graphical visualization 
of their values. The exclusion of constructs for configuration and data visualiza- 
tion from the language it allows to execute the same CARPET program using 
different conhgurations. Furthermore, it makes possible to change from time to 
time the size of the automaton and/or the number of the processors onto which 
the automaton must be executed. Finally, this approach allows selecting the 
more suitable range of the colors for visualization of data. 

4 Practical Examples of Cellular Programming 

In this section we describes two practical examples of cellular programming writ- 
ten using the CARPET language. The first example is a typical CA application 
that simulates excitable systems. The second program is the classical Jacobi re- 
laxation that shows how it is possible to use CA languages not only for simulate 
complex systems and artificial life models, but that they can be used to imple- 
ment parallel programs in the area of fine grained applications such as hnite 
elements methods, partial differential equations and systolic algorithms that are 
traditionally developed using array or data-parallel languages. 

4.1 The Greenberg- Hastings Model 

A classical model of excitable media was introduced 1978 by Greenberg and 
Hastings [5]. This model considers a two-dimensional square grid. The cells are 
in one of a resting (0), refractory (1), or excited (2) state. Neighbors are the 
eight nearest cells. A cell in the resting state with at least s excited neighbors 
(in the program we use s = 1) becomes excited itself, runs through all excited 
and resting states and returns finally to the resting state. A resting cell with less 
than s excited neighbors stays in the resting state. 

Excitable media appear in several different situations. One example is nerve 
or muscle tissue, which can be in a resting state or in an excited state followed by 
a refractory (or recovering) state. This sequence appears for example in the heart 
muscle, where a wave of excitation travels through the heart at each heartbeat. 
Another example is a forest hre or an epidemic model where one looks at the 
cells as infectious, immune, or susceptible. 

Figure 1 shows the CARPET program that implements the two-dimensional 
Greenberg-Hastings model. It appears concise and simple because the program- 
ming level is very close to the model specification. If a Fortran-|-MPI or C-I-MPI 
solution is adopted the source code is extremely longer with respect to this one 
and, although it might be a little more efficient, it is very difficult to program, 
read and debug. 

4.2 The Jacobi RelELxation 

As a second example, we describe the four-point Jacobi relaxation on a nxn 
lattice in which the value of each element is to be replaced by the average value 
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#define resting 0 
#define refractory 1 
#define excited 2 

cadef 

{ 

dimension 2; 
radius 1 ; 

state (short value) ; 

neighbor Moore [8] ( [0,-l]North, [1 , -1] NorthEast , [1 , 0] East , 

[1 , 1] SouthEast ,[0,1] South, [-1 , 1] Southwest , 
[-1,0] West, [-1 ,-l] Northwest) ; 

} 

int i, exc_neigh=0; 

{ 

for (i=0; (i<8) && (exc_neigh==0) ; i++) 

if (Moore [i] _value == excited) exc_neigh = 1; 
switch (cell_value) 

{ 

case excited : update (cell_value, recovering); break; 
case recovering : update (cell_value, resting); break; 
default : /* cell is in the resting state */ 

if (exc_neigh == 1) 

update (cell_value, excited); 

} 

} 



Fig. 1. The Greenberg-Hastings model written in CARPET. 



of its four neighbor elements. The Jacobi relaxation is an iterative algorithm 
that is used to solve differential equation systems. It can be used, for example, 
to compute the heat transfer in a metallic plate on which boundaries there is a 
given temperature. At each step of the relaxation the heat of each plate point 
(cell) is updated by computing the average of its four nearest neighbor points. 
Figure 2 shows a CARPET implementation. The initial if statement is used to 
set the initial values of cells that are taken to be 0.0 except for the western edge 
where boundary values are 1.0. 

The Jacobi program, although it is a simple algorithm, is another example 
of how a CA language can be effectively used to implement scientific programs 
that are not properly in the original area of cellular automata. This simple case 
illustrates the high-level features of the CA languages that can be also used for 
implement applications that are based on the manipulation of arrays such as 
systolic algorithms and finite elements methods. 

For the Jacobi algorithm we present some performance benchmarks that have 
been obtained by executing the CARPET program using different grid sizes and 
processor numbers. Table 1 shows the execution times for 100 relaxation steps 
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cadef 

{ 

dimension 2; 
radius 1 ; 

state ( float elem ) ; 

neighbor Neum[4] ( [0, -1] North, [-1 ,0] West , [0,1] South, [1,0] East) ; 

> 

int sum ; 

{ 

if (step == 1 ) 
if (GetY == 1) 

update (cell_elem, 1.0); 
else 

update (cell_elem, 0.0); 

else 

{ 

sum = North_elem+South_elem+East_elem+West_elem; 
update (cell_elem, sum/4) ; 

} 



Fig. 2. The Jacobi iteration program written in CARPET. 



for three different grid sizes (100x200, 200x200 and 200x400) on 1, 2, 4, 8 and 
10 processors of a QSW CS-2 multicomputer. From the figure we can see that 
as the number of used processors increases, there is a corresponding decrease 
of the execution time. This trend is more evident when larger grids are used; 
while smaller CA do not use efficiently the processors. This means that, because 
of the algorithm simplicity, when we run an automaton with a small number 
of cells we do not need to use several processing elements. On the contrary, 
when the number of cells in the lattice is high, the algorithm benefits from the 
use of a higher number of computing resources. This can be also deduced from 
table 2 that shows the relative speed up results for the three different grids. In 
particular, we can observe that when a 200x400 lattice of cells is used we obtain 
a superlinear speed up in comparison to the sequential execution mainly because 
of memory allocation and management problems that occur when all the 80,000 
cells are allocated on one single processing element. 



Table 1. Execution time (in sec.) of 100 iterations for the Jacobi algorithm 



Grid Sizes 


1 Proc 


2 Procs 


4 Procs 


8 Procs 


10 Procs 


100x200 


1.21 


0.65 


0.37 




0.25 


200x200 


3.62 


1.25 


0.67 


0.42 


0.37 


200x400 


8.22 


3.65 


1.26 


0.74 


0.62 
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Table 2. Relative speed up of the Jacobi algorithm 



Grid Sizes 


1 Proc 


2 Procs 


4 Procs 


8 Procs 


10 Procs 


100x200 


1 


1.86 


3.27 




4.84 


200x200 


1 


2.89 


5.40 


8.62 


9.78 


200x400 


1 


2.25 


6.52 


11.10 


13.25 



5 Conclusions 

The primary function of programming languages and tools has always been to 
make the programmer more effective. Appropriate programming languages and 
tools may drastically reduce the costs for building new applications as well as 
for maintaining existing ones. It is well known that programming languages can 
greatly increase programmer productivity by allowing the programmer to write 
high-scalable, generic, readable and maintainable code. Also, new domain spe- 
cific languages, such as CA languages, can be used to enhance different aspects 
of software engineering. The development of these languages is itself a signifi- 
cant software engineering task, requiring a considerable investment of time and 
resources. Domain-specific languages have been used in various domains and the 
outcomes have clearly illustrated the advantages of domain specific-languages 
over general purpose languages in areas such as productivity, reliability, and 
flexibility. 

The main goal of the paper is answering the following question: How does 
one program cellular automata on parallel computers? We think that it is very 
important for an effective use of cellular automata for computational science 
on parallel architectures to develop and use high-level programming languages 
and tools that are based on the cellular computation paradigm. These languages 
may provide a powerful instrument for scientists and engineers that need to 
implement real-life applications on parallel machines using a fine-grain approach. 
This approach allows designers to concentrate on ”how to model a problem” 
rather than on architectural details as occurs when people use low-level languages 
that have not been specifically designed to express fine-grained parallel cellular 
computations. 

In a sense, parallel cellular languages provide a high-level paradigm for fine- 
grain computer modeling and simulation. While efforts in sequential computer 
languages design focused on how to express sequential objects and operations, 
here the focus is on finding out what parallel cellular objects and operations are 
the ones we should want to define [9] . Parallel cellular programming is emerging 
as a response to these needs. 

After discussing the main issues in programming scientihc applications by 
means of parallel cellular languages, we discussed the CARPET language as an 
example in this class of languages. By CARPET we described the implemention 
of two application examples that illustrate the main features of this approach. 
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Currently CARPET and the latest version of its run-time system named 
CAMELot [CAMEL open technology) are used for the implementation of mod- 
els and simulation of complex phenomena and they are available on parallel 
architectures and cluster computing systems that use Sun Solaris, SGI IRIX, 
Red Hat Linux and Tru64 UNIX 4. OF as operating systems. 
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Abstract. Complexity Engineering deals with harnessing the power of Cellular 
Automata (CA) like simple models to solve real life difficult and complex 
engineering problems, dealing with systems that have very simple components 
that collectively exhibit complex behaviors. Cellular Automata (CA) are 
examples of dynamical systems which may instead exhibit "self organizing" 
behavior with increasing time. CAs are commonly used in modeling modular 
systems. An important aspect of modularity in engineering systems is the 
abstraction it makes possible. Once the construction of a particular module has 
been completed, the module can be treated as a single object, and only its 
behavior need be considered, wherever the module appears. One such 
application of modularity is described in this paper where a structural plate is 
considered as composed of smaller “structural modules” which are considered 
as cells in a lattice of sites in a CA and have discrete values updated in discrete 
time steps according to local rules. These local rules are generally fixed in a 
CA, but we consider these rules as evolvable. To evolve the local rules, we use 
the Genetic Algorithm (GA) model. Though the application described here is 
simple, it will still serve to demonstrate that the GA can discover CA rules that 
give rise to emergent computational strategies by self-organization, to exhibit 
globally coordinated tasks in optimization by simple local interactions only. 



1 Introduction 

In conventional engineering, systems are built to achieve very specific goals by 
exhibiting specific "global" behavior. Even the behavior of each of their component 
“local” parts is strictly designed and these have only specific reasons for their 
existence. The overall behavior of these systems must be simple enough so that 
complete prediction and often analysis also, is possible. Thus, for example, motion in 
conventional mechanical engineering devices is usually constrained to be periodic. Of 
course, more complex behavior could be realized or expected from the basic 
components of a mechanical engineering device but principles necessary to make use 
of such behaviors and theory necessary to analyze such behaviors, is not yet known. 
On the contrary, nature provides many examples of systems whose basic components 
are simple, but whose overall behavior is extremely complex. Mathematical models 
such as Cellular Automata (CA) capture the essential features of such "bottom-up" 
complex systems. Complexity Engineering deals with harnessing the power of CA like 
simple models to solve real life difficult engineering problems, dealing with systems 
that have very simple components that collectively exhibit complex behaviors. 

J.M.L.M. Palma et al. (Eds.): VECPAR 2000, LNCS 1981, pp. 221-229, 2001. 
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Generally speaking, discrete dynamical systems that follow the second law of 
thermodynamics evolve with time to maximal entropy and complete disorder. But 
Cellular Automata are examples of dynamical systems, which may instead exhibit 
"self organizing" behavior with increasing time. Even starting from complete disorder, 
their irreversible evolution can spontaneously generate ordered structures. Sometimes 
even decrease of entropy with time is noticed as a result of self-organization. This 
paper introduces the idea of using genetic (GA based) learning of a Cellular Automata 
(CA) that takes a disordered structural layout to an ordered one, satisfying several 
conflicting design criteria and finally producing an optimal or acceptable 
structure/design. To evolve the Cellular Automata (its rules) we use a Genetic 
Algorithm (GA) which is a widely accepted computational framework for evolution. 
The GA encodes the CA rules and evolves them progressively with time to more 
efficient ones (culling the less efficient ones in the evolution process). Detailed 
computer simulation studies have been performed and results are presented in this 
paper. Though the application described here is simple, it serves to demonstrate that 
the GAs can "discover" CAs that give rise to emergent computational strategies and 
exhibit global coordination tasks in structural optimization of complex engineering 
systems by simple local interactions only. 



2 Self Organization 

The second law of thermodynamics implies that isolated microscopically reversible 
physical systems tend with time to states of maximal entropy and maximal “disorder”. 
However “dissipative” systems involving microscopic irreversibility may evolve from 
“disordered” to more “ordered” states. This phenomenon of evolving from 
“disordered” to more “ordered” states can be seen in dynamically stable systems like 
Cellular Automata (CA)[1]. An elementary CA is a single array of “cells” capable of 
transforming themselves from one discrete “state” to some other. A certain definite set 
of interaction rules, governing how cells change their states in accordance to the value 
of the state of the neighboring cells, is given to each cell. With simple initial 
configurations, a CA either tends to homogeneous states or generates self-similar 
patterns with fractal dimensions s 1.59 or s 1.69. With random initial configurations, 
the irreversible character of the cellular automaton evolution leads to several self- 
organization phenomena. Statistical properties of the structure generated are found to 
lie in two universality classes, independent of the details of the initial state or the CA 
rules. This paper shows how the CA model can be used for its self-organizing 
capabilities for design optimization of structural plates and shells. The design goal is 
to find a minimum weight structure. Elements of this structure “adapt” to a minimum 
weight design, using the rules of the CA to transform their states (cell thickness). 
Rules are then subjected to genetic evolution by using a GA. A complex system is 
simulated with the interacting structural sub-elements being encoded as the cells of the 
CA. Sequential gathering of information on the interaction of structural sub-elements 
progressively modifies the different elements (variables) of the structure and the 
overall system evolves with time. This means that the rules of the CA (used for 
interaction between the structural sub-elements), are progressively modified by the 
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genetic recombination (crossover and mutation), as these rules are directly encoded in 
the genotype string of the GA. Complex adaptive systems theory has nexus with 
structural optimization, both large and small scale, because both exhibit global 
behavior as a result of local action-reaction patterns, but this has not so far been 
exhaustively studied or experimented with. However, there has been some preliminary 
research done in the area of evolution of simple CA models [2]. 



3 Structural Optimization by Evolving a Cellular Automata 

An evolutionary CA model has been applied here to structural optimization hy 
combining the strengths of CA and a Genetic Algorithm (GA)[3], to find local rules of 
a CA which can minimize the weight of a plate structure. The plate is subjected to an 
external load, which may be distributed or at a point. The plate is 50mm X 50mm 
square with fixed left edge and concentrated load on the right edge. This plate is 
divided into 25 unit square elements and a discrete set of plate thickness is defined for 
each of these elements. The Cellular Automata encodes the configuration of this plate, 
with the state of the cells in the CA representing the discrete set of plate thiekness, 
changing from one thickness to any other as a result of interaction with its neighboring 
elements as defined by the exhaustive set of CA rules. These rules are then subjected 
to evolutionary improvement by undergoing the Genetic Algorithm iterative cycle. 
The CA lattice starts out with an Initial Configuration (IC) of cell states (Os and Is) 
and this configuration changes in diserete time steps in whieh all cells are updated 
simultaneously according to the CA rules. A table of these rules is encoded in the GA 
chromosome string and they evolve over time to give a rich set of rules that can take 
any starting random IC of a CA to a desired final configuration. 

3.1 Cellular Automata 

The structure of a system need not be complicated for its behavior to be highly 
complex, corresponding to a complicated computation. Computational irreducibility 
may thus be found even among systems with simple construction. Cellular Automata 
(CA) provide such an example [1]. A CA consists of a lattice of cells, each with k 
possible values, and each updated in time steps by a deterministic rule, depending on 
the neighborhood of R sites. Cellular automata are thus mathematical idealizations of 
physical systems in which space and time are discrete, and physical quantities take on 
a finite set of discrete values. A cellular automaton typically consists of a regular 
uniform array of cells. The state of the cellular automaton is completely specified by 
the values of the variables at each of these cells. The cellular automaton evolves in 
discrete time steps, with the value of the variable at one cell being affected by the 
values of the variables at sites in its “neighborhood” on the previous time step. The 
neighborhood of a cell is typically taken to be the cell itself and all immediately 
adjacent cells. The variables at each cell are updated simultaneously 
(“synchronously”), based on the values of the variables in their neighborhood at the 
preceding time step, and according to a definite set of local rules [1]. 
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A one-dimensional cellular automaton is a lattice of N two-state machines (“cells”), 
each of which changes its state as a function only of the current states in a local 
neighborhood. The lattice starts out with an initial configuration (1C) of cell states (Os 
and Is) and this configuration changes in diserete time steps in which all cells are 
updated simultaneously aceording to the CA “rule” . Here the term “state” is used to 
refer to the value of a single cell. The term “eonfiguration” is used to refer to the 
eolleetion of local states over the entire lattice. A CA's rule can be expressed as a 
lookup table (“rule table”) that lists, for eaeh local neighborhood, the state whieh is 
taken on by the neighborhood's eentral eell at the next time step. For a binary-state 
CA, these update states are referred to as the “output bits” of the rule table. In a one- 
dimensional CA, a neighborhood consists of a cell and its r (“radius”) neighbors on 
either side. The CA implemented in our model does not have periodie boundary 
conditions where the lattice is viewed as being eircular, instead special rules are 
described for the edges of the lattice (boundaries). 

Cellular automata have been studied extensively as mathematical objects, as models 
of natural systems, and as architectures for fast, reliable parallel computation. 
However, the difficulty of understanding the emergent behavior of CAs or of 
designing CAs to have desired behavior has up to now severely limited their use in 
scienee and engineering, for general computation and of course for optimization. The 
work described here is on using genetic algorithms to obtain certain optimal CAs to 
perform eomputations for optimization tasks. Typically, a CA performing a 
computation means that the input to the computation is encoded as the Initial 
Configuration (IC), the output is decoded from the configuration reached at some later 
time-step, and the intermediate steps that transform the input to the output are taken as 
the steps in the eomputation. The computation emerges from the CA rule being 
obeyed by each cell. 

To produee CAs that ean perform sophisticated parallel computations, the Genetie 
Algorithm (GA) must search for CAs in which the actions of the eells, taken together, 
is coordinated so as to produce the desired behavior. This coordination must, of 
course, happen in the absence of any central processor or central memory direeting the 
eoordination. Some early work on evolving CAs with GAs was done by Packard [4]. 
Koza[5] also applied genetic programming to evolve CAs for simple random-number 
generation. In this work, we have used a form of the GA to evolve one-dimensional, 
binary-state r = 2 CAs to perform a struetural optimization task. 

3.2 Genetic Algorithms 

Genetic Algorithms[3] are distinguished by their simultaneous parallel investigation of 
several areas of a search space, through manipulation of a population, members of 
which are coded problem solutions. The task environment for these applications, is 
modeled as an exclusive evaluation function which, in most cases is called a fitness 
function that maps an individual of the population into a real scalar. The motivational 
idea behind GA is natural selection. Genetic operators like selection, crossover and 
mutation are implemented to emulate the process of natural evolution. A population of 
“organisms” (usually represented as bit strings) is modified by the probabilistic 
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application of the genetic operators from one generation to the next. GAs also have a 
potential for multi-dimensional optimization as they work with population of solutions 
rather than a single solution. A detailed explanation of the theory and working of the 
GA can be found in numerous examples in the existing literature on the subject, for 
example in Goldberg [3]. 



4 GA Encoding of the CA Rules 

The principal difficulty in this model is to find a suitable encoding technique for each 
GA genotype string (chromosome or bit string) which has to represent a candidate 
rule set (or a rule table). We propose a new encoding technique here. A CA’s rule 
table n can be expressed as a rule-table that lists, for each local neighborhood, the 
state which is taken on by the neighborhood’s central cell at the next time step. This is 
illustrated in Figure 1 . For a binary-state CA, these update states are referred to as the 



Rule table It : 

neighborhood: 000 001 010 011 100 101 110 111 

output bit; 00010111 

(encoded as GA Genotype) 
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Fig 1: Rule Table ( 11 ) representation technique in one- 
dimensional, binary state, nearest neighbor (r =1) cellular 
automaton with A = 1 1 . 
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Fig 2: Output bit of each rules is encoded as the “genes” in 
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“output-bits” of the rule table as shown in Figure 2. For all possible permutations of 
neighborhood values (r = 1), we first define the output bit. 

Each GA chromosome (bit string) consists of the output bit and the whole set of local 
rules (rule table) is encoded as the chromosome. Thus each population member 
(chromosome) of the GA represents a candidate rule table in our model. We consider 
unit length neighborhood of cells in four cardinal directions, which affect the state of 
each cell, as shown in Figure 3. Thus we have four cells that affect each cell plus the 
cell itself These five cells have 2 states each making 32 states in all (2^=32), and 
consequently 32 basic rules. We then have special rules as in Case 1 - Referring to 
Figure 4 we have one generalized rule for comer elements (1,2, 3,4) and one 
generalized mle for each group of edge elements (5, 6,7, 8) with the 32 rules for 
elements marked 9. This makes 34 rules in all that are evolved by the GA. Case 2. - 
Referring to Figure 4 we have four special mles for each of the comer elements 
(1,2, 3,4) and four special rules for each group of the edge elements (5, 6, 7, 8) with the 
32 rales for elements marked 9. This makes 40 rales in all that are evolved by the 
GA. 
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Fig 3: Neighborhood of each cell in 
four directions that are considered 
while encoding the CA computations in 
the GA genotype. 



Fig 4: Special CA rales of edges and 
for corners of the structural plate. 



5 Numerical Application 

The CA evolves through a series of transformation of its states which change in 
discrete time, given the local rale set for each of Case 1, and Case 2 described above. 
We limit the CA transformation to 25 and then analyze the plate structure, (with 
variable material distribution through out its 25 elements) for stresses, by a Finite 
Element Analysis (FEA) program. The Mises stresses of each elements are calculated 
by FEA by employing decomposition of each square element into the upper and lower 
triangular element and determining the constraint violation conditions of these 
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triangular elements. If there are any stresses which violate the given stress constraint 
and allowances, we add an appropriate penalty on its weight, which proportionately 
reduces the fitness returned to the GA for that particular table of local rules (GA 
chromosome). Each set of local rules is subjected to 100 different ICs of the CA and 
we use the average fitness for all these 100 ICs as the fitness that is returned to the GA 
to evaluate the goodness or utility of each set of local rules that are generated by each 
GA iteration. Thus, we progressively modify the set of local rules which helps us to 
achieve the minimum weight design in the smallest number of CA state transformation 
cycles. The overall system architecture is illustrated in Figure 5. 




Fig 5 : The system architecture for the evolving CA rules for structural optimization. 



Table A. The 32 CA rules that produced the structural plate design shown in Figure 
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5.1 Optimization Results 

Figures 6 to 10 show results of preliminary experimentation and computer simulation. 
These results also confirm that a self-organizing approach can be used under the 
limitation of computational abilities, to find (or learn) the best set of local rules for 
optimization of plate like structures with various distributed plate thicknesses. We 
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believe that the reader will comprehend the figures of optimization results better, 
rather than the description of the final rules and so we present the figures of final 
results obtained. Nevertheless as an example, some final evolved rules are shown in 
Table A. The 32 CA local rules shown in Table A, along with the 8 edge rules (Case 
2: ref Section 3.1) produce the final design presented in Figure 10. The top lines of 
Table A shows the neighborhood of the cells and the bottom lines, the output bit (ref 
Figure 1). 
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FigurelO. 

Number of rules =40 
GA Population = 150 
Crossover rate = 40% 
Mutation rate = 1% 
Generation = 66 



6 Conclusions 

The discovery of rales that produce global optimization of structural plates show 
instances of GA's producing sophisticated emergent computation in decentralized, 
distributed systems such as CAs. The rale discoveries made by a GA are encouraging 
for the prospect of using GAs to automatically evolve computation for more complex 
tasks and in more complex optimization systems. Moreover, evolving CAs with GAs 
also gives us a tractable framework in which to study the mechanisms by which an 
evolutionary process might create complex coordinated behavior in natural 
decentralized distributed systems. 
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Abstract. A first-principles program designed to compute, among other 
quantum-mechanical observables, the total energy of a given molecule, 
is efficiently parallelized using MPI as the underlying communication 
layer. The resulting program fully distributes CPU and memory among 
the available processes, making it possible to perform large-scale Monte- 
Carlo Simulated Annealing computations of very large molecules, ex- 
ceeding the limits usually attainable by similar programs. 



1 Introduction 

At present, an enormous effort is being dedicated to the study and fabrication of 
nano-structures and new materials, which calls for a framework to compute, from 
first-principles, and predict, whenever possible, properties associated with these 
types of systems. Among such frameworks. Density Functional Theory (DFT) 
constitutes one of the most promising. Indeed, the success of DFT to compute 
the ground-state of molecular and solid-state systems has been recognized in 
1998 with the award of the Nobel Prize of Chemistry to Walter Kohn and John 
Pople. DFT provides a computational framework with which the properties of 
molecules and solids can, in certain cases, be predicted within chemical accu- 
racy (= 1 Kcal/mol). Therefore, it is natural to try to use at profit the most 
recent computational paradigms in order to break new frontiers in these areas 
of research and development. 

In this work we report the successful parallelization of an ah-initio DFT pro- 
gram, which makes use of a Gaussian basis-set. This, as will become clear in the 
following section, is just one of the possible ways one may write down a DFT- 
code. It has, however, the advantage of allowing the computation of neutral and 
charged molecules at an equal footing, of making it possible to write the code 
in a modularized fashion (leading to an almost ideal load-balance), as well as it 
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is taylor-made to further exploit the recent developments of the so-called order- 
N techniques. As a result, the program enables us to carry out the structural 
optimization of large molecules via a Monte-Carlo Simulated Annealing strategy. 

Typically, the implementation of a molecular DFT-code using Gaussian, lo- 
calized, basis-states, scales as Nj^^, or depending on implementation, where 
Nat is the number of atoms of the molecule. Such a scaling constitutes one of the 
major bottlenecks for the application of these programs to large (> 50 atoms) 
molecules, without resorting to dedicated supercomputers. The fact that the 
present implementation is written in a modular fashion makes it simple and ef- 
ficient to distribute the load among the available pool of processes. All tasks 
so-distributed are performed locally in each process, and All data required to 
perform such tasks is also made available locally. Furthermore, the distribution of 
memory among the available processes is also done evenly, in a non-overlapping 
manner. In this way we optimize the performance of the code both for efficiency 
in CPU time as well as in memory requirements, which allows us to extend the 
range of applicability of this technique. 

This paper is organized as follows: In Section II a brief summary of the un- 
derlying theoretical methods and models, as applied to molecules, is presented, 
in order to set the framework and illustrate the problems to overcome. In Sec- 
tion III the numerical implementation and strategy of parallelization is discussed, 
whereas in Section IV the results of applying the present program to the struc- 
tural optimization of large molecules using Simulated Annealing are presented 
and compared to other available results. Finally, the main conclusions and future 
prospects are left to Section V. 

2 Molecular Simulations with DFT 

In the usual Born-Oppenheimer Approximation (BOA) the configuration of a 
molecule is defined by the positions Ri of all the Nat atoms of the molecule and 
by their respective atomic number (nuclear charge) . The energy of the electronic 
ground state of the molecule is a function Ecs{Ri, ■ ■ ■ , RNat ) of those nuclear 
positions. One of the objectives of quantum chemistry is to be able to calcu- 
late relevant parts of that function, as the determination of the full function is 
exceedingly difficult for all except the simplest molecules. In practice one may 
try to find the equilibrium configuration of the molecule, given by the minimum 
of Eos, or one may try to do a statistical sampling of the surface at a given 
temperature T. That statistical sampling can be done by Molecular Dynamics 
(MD) or by Monte-Carlo [MC] methods. By combining the statistical sam- 
pling at a given T with a simulation process in which one begins at a high T 
and, after equilibrating the molecule, starts reducing the T in small steps, always 
equilibrating the molecule before changing T, one realizes an efficient algorithm 
for the global minimization of EqSi the so-called Simulated Annealing Method 
(SAM). 

The calculation of Eqs for a single configuration is a difficult task, as it 
requires the solution of an interacting many-electron quantum problem. In Kohn- 
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Sham DFT this is accomplished by minimizing a functional of the independent 
electron orbitals i/'i(r), 

Egs{R\, ■ • ■ , RN^t) = min -EtfS (-Ri , ■ • • , RNatji’i^ • ■ • I '4’n^i) (1) 

in 

where Nei is the number of electrons of the molecule, and the minimization is 
done under the constraint that the orbitals remain orthonormal, 

Jipi{r)ij)j{r)(fr = 6^j. (2) 

The Euler-Lagrange equation associated with the minimization of the Kohn- 
Sham functional is similar to a one particle Schrodinger equation 

- + Ves{r;i}i, . . . ,ipn)i>i{r) = tiipiir), (3) 

except for the non-linear dependence of the effective potential Ueff on the or- 
bitals. As our objective here is to discuss the numerical implementation of our 
algorithms, we will not discuss the explicit form of Ueu and the many approxi- 
mations devised for its practical calculation, and just assume one can calculate 
Ueff given the electron wavefunctions '0i(r). The reader can find the details on 
how to calculate Ueff in excellent reviews, e. g., refs. [1,2] and references therein. 
If one expands the orbitals in a finite basis-set, 

M 

Mr) ( 4 ) 

3 

then our problem is reduced to the minimization of a function of the coefficients, 

Egs{Ri, ■ • ■ 5 RNat) ~ mini?ifs(Ri, . . . , Rn^^;ch, . . . , cn^im) (5) 

Cij 

and the Euler-Lagrange equation becomes a matrix equation of the form 

^ [ Cjj [Hkj — €-iSkj] = 0 ( 6 ) 

3 

where the eigenvalues are obtained, as usual, by solving the secular equation 

det\H,j - ESij\=Q. (7) 

The choice of the basis-set is not unique[3]. One of the most popular basis-sets 
uses Gaussian basis-functions 



4>i{r) = Ni exp(-ai(r - Rif)Z'^)^\r - Ri) (8) 

where the angular funtions are chosen to be real solid harmonics, and Ni are 
normalization factors. These functions are centered in a nucleus Ri and are an 
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example of localized basis-sets. This is an important aspect of the method, since 
this implies that the matrix-elements Hij result, each of them, from the contribu- 
tion of a large summation of three-dimensional integrals involving basis-functions 
centered at different points in space. This multicenter topology involved in the 
computation of Hij ultimately determines the scaling of the program as a func- 
tion of Nat- Finally, one should note that, for the computation of Hij one needs 
to know rieff which in turn requires knowledge of ipi{r). As usual the solution is 
obtained via a self-consistent iterative scheme, as illustrated in fig.l . 




Fig. 1. self-consistent iterative scheme for solving the Kohn-Sham equations. 
One starts from an educated guess for the initial density which, in DFT, can be 
written in terms of the eigenfunctions of the Kohn-Sham equations as p(r) = 
'n,i After several iterations one arrives at a density which does not 

change any more upon iteration. 



Due to the computational costs of calculating Eqs from first principles, for a 
long time the statistical sampling of Eos has been restricted to empirical or sim- 
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plified representations of that fnnction. In a seminal paper, Car and Parrinello[4] 
{CP) proposed a method that was so efficient that one could for the hrst time 
perform first-principles molecular dynamics simulations. Their key idea was to 
use molecular dynamics, not only to sample the atomic positions but also to 
minimize in practice the Kohn-Sham functional. Furthermore they used an effi- 
cient manipulation of the wave-functions in a plane-wave basis-set to speed up 
their calculations. Although nothing in the CP method is specific to a given 
type of basis-set, the truth is that the overwhelming number of CP simulations 
use a plane- wave basis-set, to the point that most people would automatically 
assume that a CP simulation would use a plane wave basis-set. 

Although one can use plane-waves to calculate molecular properties with a 
super-cell method, most quantum chemists prefer the use of gaussian basis-sets. 
What we present here is an efficient parallel implementation of a method where 
the statistical sampling of the atomic positions is done with MC and the Kohn- 
Sham functional is directly minimized in a gaussian basis-set. 



3 Numerical Implementation 

3.1 Construction of the Matrix 



Each matrix-element Hij has many terms, which are usually classified by the 
number of different centers involved in its computation. The time and memory 
consuming terms are those associated with three center integrals used for the 
calculation of the effective potential Ueff. For the sake of simplicity we will assume 
that the effective potential is described also as a linear combination of functions 
9k{r), 

L 

Weff(r, ^ fk{{cij}) gk{r), (9) 

fc=i 

where the coeficients fk have a dependence on the wavefunction coefficients, 
and Qk are atom centered gaussian functions. Actually, in the program only the 
exchange and correlation term of the effective potential is expanded this way, 
but the strategy of parallelization for all other contributions is exactly the same, 
and so we will not describe in detail the other terms. 

The contribution of the effective potential to the hamiltonian Hij is 



Vij= / (t)i{r)v^ii{r,{i}i})(l)j{r)d^r = ^fk{{cij}) / <j)i{r)gk{r)(j)j{r)(fr 



— y ] fk{{Cij})^ikj 



where the integral Aikj = J <j)i{r)gk{r)(j)j{r)(Pr involves three gaussian func- 
tions, and can be calculated analytically. Furthermore all dependence on wave- 
function coefficients is now in the coefficients fk of the potential, and the integrals 
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Aikj are all the same in the self-consistent iterations. This means that all the 
iterative procedure illustrated in fig. 1 amounts now to recombine repeatedly the 
same integrals, but with different coefficients at different iterations throughout 
the self-consistent procedure. 

We can now appreciate the two computational bottlenecks of a gaussian 
program. As the indexes i,j and k can reach to several hundred the size of the 
three-index array Aikj requires a huge amount of memory. Although analytical, 
the calculation of each of the Aikj is non-trivial and requires a reasonable number 
of floating point operations. The summation in eq. 10 has to be repeated for each 
of the self-consistent iterations. 

So far, no parallelization has been attempted. We now use at profit the 
modular structure of the program in order to distribute tasks among the available 
processes in an even and non-overlapping way. In keeping with this discussion, 
we recast each matrix-element Vij in the form 



-^proc 

( 11 ) 

A=1 

where the indexed Vij [A] will be evenly distributed among the A^proc processes 
executing the program, that is, it will be null except in one of the processes. 
Similarly, the three-index array Aikj is distributed as 

-^proc 

Aikj — ^ ^ Ai/j,j [A] (12) 

A=1 

in such a way that Aikj [A] is null if Vij [A] is null. Of course, the null elements 
are not stored so the large array is distributed among all the processes, which 
for a distributed memory machine means that Aikj is distributed among all the 
processes. As 

L 

fk{{Cij})Aikj[X] (13) 

fe=i 

there is no need to exchange the values of Aikj among processes, but only those 
of fk before summation, and Vij [A] after the summation. So the calculation of 
Aikj is distributed among the processes, the storage is also distributed, and Aikj 
never appears in the communications. 

Finally, and due to the iterative nature of the self-consistent method, the 
code decides - a priori - which process will be responsible for the computation 
of a given contribution to Vij [A] . This allocation is kept unchanged throughout 
an entire self-consistent procedure. 



3.2 Eigenvalue Problem 

For Nat atoms and, assuming that we take a basis-set of M gaussian functions 
per atom, our eigenvalue problem, eqs. 6 and 7, will involve a matrix of dimension 
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{Nat X M). Typical numbers for an atomic cluster made out of 20 sodium atoms 
would be Nat = 20 and M = 7. This is a pretty small dimension for a matrix to 
be diagonalized, so the CPU effort is not associated with the eigenvalue problem 
but, mostly, with the construction of the matrix-elements Hij. We have not 
yet parallelized this part of the code. Its paralellization, poses no conceptual 
difficulty, since this problem is taylor made to be dealt with by existing parallel 
packages, such as SCALAPACK. As this part of the code is the most CPU time 
consuming among the non-paralelized parts of the code, it is our next target for 
parallelization. 



3.3 Monte-Carlo Iterations 

Once Egs{Rij ■ ■ ■ , RNat) obtained for a given molecular conhguration, the 
Monte-Carlo Simulated Annealing algorithm “decides” upon the next move. As 
stated before, this procedure will be repeated many thousands of times before an 
annealed struture is obtained, hopefully corresponding to the global minimum 
of Egs- 

When moving from one MC iteration to the next, the Simulated Annealing 
algorithms typically change the coordinates of one single atom Ra Ra + SR. 
As the basis set is localized, each of the indices in Aijk is associated with a given 
atom. If none of the indices is associated with the atom J2„, than Atjk does not 
change, and therefore is not recalculated. In this way, only a fraction of the order 
of 1/fVat of the total number of integrals Atjk needs to be recalculated, leading 
to a substantial saving in computer time, in particular for the larger systems ! 
Furthermore, the “educated guess” illustrated in fig. 1, used to start the self- 
consistent cycle is taken, for MC iteration n -I- 1, as the self-consistent density 
obtained from iteration n. In this way, in all but the start-up MC iteration, the 
number of iterations required to attain self-consistency becomes small. It is this 
coupling between the Monte-Carlo and DFT parts of the code that allow us to 
have a highly efficient code which enables us to run simulations in which the 
self-consistent energy of a large cluster needs to be computed many thounsands 
of times (see below). 



4 Results and Discussion 

The program has been written in FORTRAN 77 and we use MPI as the underly- 
ing communication layer, although a PVM translation would pose no conceptual 
problems. Details of the DFT part of the program in its non-parallel version have 
been described previously ref[6]. The MC method and the SAM algorithm are 
well-described in many excellent textbooks [7]. 

The Hardware architecture in which all results presented here have been ob- 
tained is assembled as a farm of 22 DEC 500/500 workstations. The nodes are 
connected via a fast-ethernet switch, in such a way that all nodes reside in the 
same virtual (and private) fast-ethernet network. In what concerns Software, the 
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22 workstations are running Digital Unix version 4.0-d, the DEC Fortran com- 
piler together with DXML-libraries, and the communication layer is provided by 
the free MPICH[8] distribution, version 1.1. Nevertheless, we would like to point 
out that the same program has been tested successfully on a PC, a dual-Pentium 
11-300, running Linux-SMP, g77-Fortran and LAM-MPI[9] version 6.2b. 

We started to test the code by choosing a non-trivial molecule for which 
results exist, obtained with other programs and using algorithms different from 
the SAM . Therefore, we considered an atomic cluster made out of eight sodium 
atoms - Previous DFT calculations indicate that a D 2 d structure - left 

panel of fig. 3 - corresponds to the global minimum of Egs[Q]- 

Making use of our program, we have reproduced this result without difficulties. 
Indeed, we performed several SAM runs starting from different choices for the 
initial structure, and the minimum value obtained for Eos corresponded, indeed, 
to the D 2 d structure. One should note that one SAM run for Na% involves the 
determination of Eqs up to 2,2 10"^ times. Typically, we have used 1000 MC- 
iterations at a given fixed-temperature T in a single SAM run. This number, 
which is reasonable for the smaller clusters, becomes too small for the larger, 
whenever one wants to carefully sample the phase-space associated with the 
{i?i, . . . , RncA coordinates. 

As shown in the right panel of fig. 2, Na^ was our second choice. This is a nine 
atom sodium cluster to which one electron has been removed. As is well known[5] 
this cluster, together with Nas, constitute so-called magic clusters, in the sense 
that they display an abnormally large stability as compared to their neighbours 
in size[10]. When compared with quantum-chemistry results, the DFT structures 
are different, both for Nas and Nag . This is not surprising, since the underlying 
theoretical methods and the minimization strategies utilized are also different, at 
the same time that the hyper-surface corresponding to EcsURi}) is very shallow 
in the neighbourhood of the minima, irrespective of the method. Nevertheless, 
recent experimental evidence seem to support the DFT results [10]. 




Fig. 2. global minimum of Eqs for the two magic sodium clusters Nas and 
Nag . For the determination of such global minima a SAM algorithm has been 
employed, requiring many thousands of first-principles computations of Eqs to 
be carried out. 
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In order to test the performance of the parallelization, we chose Nag and 
carried out two different kinds of benchmarks. First we executed the program 
performing 1 iteration - the start-up iteration - for Nag and measured the CPU 
time Tcpu as a function of the number of processes A^proc- For the basis-set 
used, the number of computed Aikj elements is, in this case 328779. As can be 
seen from eq. 13, the ratio of computation to communications is proportional to 
the number of fit functions L. By choosing a small molecule where L is small 
we are showing an unfavorable case, where the parallelization gains are small, 
so we can discuss the limits of our method. In fig. 3 we plot, with a solid line, 
the inverse of the CPU time as a function of A^proc- 

Our second benchmark calculation involves the computation of 100 MC- 
iterations. For direct comparison within the same scale, we multiplied the inverse 
of Tcpu by the number of iterations. The resulting curve is drawn with a dashed 
line in hg. 3. 




Fig. 3. Dependence of inverse CPU time (multiplied by the number of MC- 
iterations) as a function of the number of processes (in our case, also dedicated 
processors) for two benchmark calculations (see main text for details). A direct 
comparison of the curves illustrates what has been parallelized in the code and 
where the parallelization plays its major role. 



Several features can be inferred from a direct comparison of the 2 curves. First 
of all, there is an ideal number A^proc into which the run should be distributed. 
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Indeed, fig. 3 shows that efficiency may actually drop as A^proc is increased. For 
this particular system, iVpRoc = 8 is the ideal number. This “node-saturation” 
which takes place here for Nag is related to the fact that the time per iteration is 
small enough for one to be able to observe the overhead in communications due 
to the large number of nodes in which the run is distributed. When the number 
of atoms increases, this overhead becomes comparatively smaller and ceases to 
produce such a visible impact on the overall benchmarks. From fig. 3 one can 
also observe that, for small A^proc , the largest gain of efficiency is obtained for 
the 1-iteration curve. This is so because that is where the parallelization plays a 
big role. Indeed, as stated in section 3, the number of floating point operations 
which are actually performed in the subsequent M(7-iterations is considerably 
reduced, compared to those carried out during the start-up iteration. As a result, 
the relative gain of efficiency as A^proc increases becomes smaller in this case. 
However, since both CPU and memory are distributed, it may prove convenient 
to distribute a given run, even if the gain is not overwhelming. 

The solid curve of fig. 3 is well fitted by the function 0, 25 — 0, 17/A^proc up to 
A^proc = 8 which reveals that a good level of parallelization has been obtained. 
This is particularly true if we consider that the sequential code has 14200 lines, 
and is very complex, combining many different numerical algorithms. 

Finally, we would like to remark that, at present, memory requirements seem 
to put the strongest restrictions on the use of the code. This is so because of 
the peculiar behaviour of MPICH which creates, for each process, a “clone- 
listener” of each original process, that requires the same amount of memory as 
the original processes. This is unfortunate since it imposes, for big molecules, to 
set up a very large amount of swap space on the disk in order to enable MPI to 
operate successfully. In our opinion, this is a clear limitation. We are, at present, 
working on alternative ways to overcome such problems. 

In fig. 4 we show our most recent results in the search for global minima of 
sodium clusters. The structures displayed in fig. 4 have now 21 (left panel) and 
41 (right panel) sodium atoms. A total of 4147605 matrix-elements is required 
to compute each iteration of the self-consistent procedure for Nagi whereas for 
Na'^i the corresponding number is 30787515. The structures shown in fig. 4 
illustrate the possibilities of the code, which are, at present limited by swap 
limitations exclusively. Of course, the CPU time for these simulations is much 
bigger than for the smaller clusters discussed previously. In this sense, the struc- 
ture shown for cannot be considered unambiguosly converged, in the sense 
that more SAM runs need to be executed. On the other hand, we believe the 
structure depicted for A^aJ^ to be fully converged. Since no direct experimen- 
tal data for these structures exists, only indirect evidence can support or rule 
out such structural optimizations. The available experimental data[10] indirectly 
supports this structure since, from the experimental location of the main peaks of 
the photo-absorption spectrum of such a cluster one may infer the principal-axes 
ratio of the cluster, in agreement with the prediction of fig. 4. 
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Fig. 4. Global minima for two large singly ionized sodium clusters with 21 atoms 
(left panel) and 41 atoms (right panel). Whereas the structure of Na^i can be 
considered as ’’converged”, the same cannot be unambiguously stated for the 
structure shown for Nal^i- For this largest cluster, the structure displayed shows 
our best result so-far, although further SAM runs need to be carried out. 



5 Conclusions and Future Applications 

In summary, we have suceeded in parallelizing a DFT code which efficiently 
computes the total energy of a large molecule. We have managed to parallelize the 
most time and memory consuming parts of the program, except, as mentioned 
in section 3.2, the diagonalization block, which remains to be done. This is 
good enough for a small farm of workstations, but not for a massive parallel 
computer. We should point out that it is almost trivial to parallelize the Monte- 
Carlo algorithm. In fact as a SAM is repeated starting from different initial 
configurations, one just has to run several jobs simultaneously, each in its group 
of processors. However, this will not have the advantages of distributing the 
large matrix As storage is critical for larger molecules, parallelizing the 

DFT part of the code may be advantageous even when the gains in CPU time 
do not look promising. 

The code is best suited for use in combination with MC-type of simulations, 
since we have shown that, under such circumstances, not only the results of a 
given iteration provide an excellent starting point for the following iteration, 
but also the amount of computation necessary to compute the total energy at 
a given iteration has been worked out, to a large extent, in the previuous it- 
eration. Preliminary results illustrate the feasibility of running first-principles, 
large-scale SAM simulations of big molecules, without resorting to dedicated 
supercomputers. Work along these lines is under way. 
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Abstract. A powerful technique used for power system composite re- 
liability evaluation is Monte Carlo Simulation (MCS). There are two 
approaches to MCS in this context: non-sequential MCS, in which the 
system states are randomly sampled, and sequential MCS, in which the 
chronological behaviour of the system is simulated by sampling sequences 
of system states for several time periods. The sequential MCS approach 
can provide information that the non-sequential can not, but requires 
higher computational effort and is more sequentially constrained. This 
paper presents a parallel methodology for composite reliability evaluation 
using sequential MCS on three different computer platforms: a scalable 
distributed memory parallel computer IBM RS/6000 SP with 10 proces- 
sors, a network of workstations (NOW) composed of 8 IBM RS/6000 43P 
workstations and a cluster of PCs composed of 8 Pentium III 500MHz 
personal microcomputers. The results obtained in tests with actual power 
system models show considerable reduction of the simulation time, with 
high speedup and good efficiency. 



1 Introduction 

The primary function of electric power systems is to satisfy the consumers’ de- 
mand in the most economic way and with an acceptable degree of continuity, 
quality and security. The ideal situation would be that the energy supply was 
uninterrupted. However, the occurrence of failures of some components of the 
system can produce disturbances capable of leading to the interruption of the 
electric energy supply. In order to reduce the probability, frequency and duration 
of these failure events and their effects, it is necessary to accomplish financial 
investments in order to increase the reliability of the system. It is evident that 
the economic and the reliability requirements can conflict and make it difficult 
to take the right decisions. 

The new competitive environment of the electric energy market makes the 
evaluation of energy supply reliability of fundamental importance when closing 
contracts between utilities companies and heavy consumers. In this context, the 
definition of the costs associated with the supply interruption deserves special 
attention, since engineers must now evaluate how much it is of interest to invest 



J.M.L.M. Palma et al. (Eds.): VECPAR2000, LNCS 1981, pp. 242-253, 2001. 
Springer- Verlag Berlin Heidelberg 2001 




Power System Reliability on Multicomputer Platforms 243 



in the system reliability, as a function of the cost of the investment itself and the 
cost of the interruption for the consumer and for the energy vendors. This new 
environment also requires reliability evaluation of larger parts of the intercon- 
nected system and it can demand, in some cases, nation-wide systems modelling. 
For this purpose, it becomes necessary to develop computational tools capable 
of modelling and analysing power systems of very high dimensions. 

One of the most important methods used for reliability evaluation of power 
systems composed of generation and transmission sub-systems is Monte Carlo 
Simulation (MCS). MCS allows accurate modelling of the power system com- 
ponents and operating conditions, provides the probability distributions of vari- 
ables of interest, and is able to handle complex phenomena and a large number 
of severe events [1,2]. 

There are two different approaches for Monte Carlo simulation when used 
for composite system reliability evaluation: Non-Sequential MCS and Sequential 
MCS. In Non-Sequential MCS, the state sampling approach is used, in which 
case the state space is randomly sampled without reference to the system op- 
eration process chronology. This implies disregarding the transitions between 
system states. In Sequential MCS, the chronological representation is adopted, 
in which case the system states are sequentially sampled for several periods, usu- 
ally years, simulating a realisation of the stochastic process of system operation. 
The expected values of the main reliability indices can be calculated by both ap- 
proaches. However, estimates of specific energy supply interruption duration and 
the probability distribution of duration related indices can only be obtained by 
sequential MCS [3]. In applications related to production cost evaluation, only 
the sequential approach can then be used. On the other hand, sequential MCS 
demands higher computational effort than non-sequential MCS. Depending on 
the system size and modelling level, the sequential MCS computer requirements 
on conventional computer platforms may become unacceptable [4]. 

In both MCS approaches, the reliability evaluation demands adequacy anal- 
ysis of a very large number of system operating states, with different topological 
configurations and load levels. Each one of these analyses simulates the operation 
of the system at that particular sampled state, in order to determine if the en- 
ergy demand can be met without operating restrictions and security violations. 
The main difference between the two approaches is concerned with the way the 
system states are sampled, which is randomly done in non-sequential MCS and 
is sequentially sampled over time in sequential MCS. Each new sampled state in 
sequential MCS is dependent on the configuration and duration of the previously 
sampled one. 

This paper describes results obtained by a parallel methodology for com- 
posite reliability evaluation using sequential MCS on three different multicom- 
puter platforms: a scalable distributed memory parallel computer IBM RS/6000 
SP with 10 processors, a network of workstations (NOW) composed of 8 IBM 
RS/6000 43P workstations and a cluster of PCs composed of 8 Pentium III 
500MHz personal microcomputers. In a previous paper [5], a methodology for 
parallelisation of composite reliability evaluation using non-sequential MCS was 
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presented. Tests performed on three electric systems showed results of almost 
linear speedup and very good efficiency obtained on a 4 nodes IBM RS / 6000 SP 
parallel computer. As a continuation of that work, this paper now deals with 
sequential MCS on three different multicomputer platforms. The chronological 
dependency between consecutive sampled states, that exists in the sequential 
MCS approach, introduces much more complexity in developing a parallel algo- 
rithm, to solve the problem, than there was in non-sequential MCS. Although 
MCS techniques are used to sample the system state configuration, each of them 
is now coupled in time with the next one, and this introduces significant sequen- 
tial constraints in the parallelisation process. 

The methodology presented in this paper is based on coarse grain asyn- 
chronous parallelism, where the adequacy analysis of the system operating states 
within each simulated year is performed in parallel on different processors and 
the convergence is checked on one processor at the end of each simulated year. 
Some actual power system models are used for evaluating the performance of 
the methodology together with the scalability correlation with the network ar- 
chitecture and bandwidth. 

2 Sequential Monte Carlo Simulation 

The power system reliability evaluation consists of the calculation of several in- 
dices, which are indicators of the system adequacy to the energy demand, taking 
into consideration the possibility of occurrence of failures of the components. In 
particular, the composite reliability evaluation considers the possibility of failures 
at both the generation and the transmission sub-systems. A powerful technique 
used for power system composite reliability evaluation is MCS [1]. One possible 
approach is the chronological representation of the system operation stochastic 
process, in which case the system states are sequentially sampled in time. One 
implementation of the chronological representation is the use of sequential MCS. 
In sequential MCS, the system operation is simulated by sampling sequences of 
operating states based on the probability distribution of the components’ states 
duration. These sequences are sampled for several periods, usually years, and are 
called yearly synthetic sequences. The duration of the states and the transitions 
between consecutive system states are represented in these synthetic sequences. 

The reliability indices calculation using sequential MCS may be represented 
by the evaluation of the following expression: 

1 ^ 

^ fe=i 



where 

N : number of simulated years 

yk' yearly synthetic sequence composed of the sampled system states within 
year k 
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F: adequacy evaluation function to calculate yearly reliability indices over the 
sequence yu 

E{F): estimate of the expected value of the adequacy evaluation function 



The reliability indices correspond to estimates of the expected values of dif- 
ferent adequacy evaluation functions F for a sample composed of N simulated 
years. For calculation of the values of F associated with the various indices, 
it is necessary to simulate the operating condition of all system states within 
a year. Each simulation requires the solution of a static contingency analysis 
problem and, in some cases, the application of a remedial actions scheme to 
determine the generation re-scheduling and the minimum load shedding. Most 
of the computational effort demanded by the algorithm is concentrated in this 
step. 

The convergence of the evaluation process is controlled by the accuracy of 
MCS estimation by the coefficient of variation a, which is a measure of the 
uncertainty around the estimates, and is defined as: 



yvmF)) 

E{F) 



(2) 



where V{E{F)) is the variance of the estimator. 

A conceptual algorithm for composite reliability evaluation using sequential 
MCS is described next: 



1. Generate a yearly synthetic sequence of system states yk; 

2. Chronologically evaluate the adequacy of all system states within the sequence 
yk and accumulate these results; 

3. Calculate the yearly reliability indices F{yk) based on the values calculated 
in step (2). 

4- Update the expected values of the process reliability indices E{E) based on 
indices calculated in step (3); 

5. If the accuracy of the estimates of the process indices is acceptable, terminate 
the process. Otherwise, return to step (1). 

The yearly synthetic sequence is generated by combining the components’ 
states transition processes and the chronological load model variation in the same 
time basis. The component states transition process is obtained by sequentially 
sampling the probability distribution of the component states duration, which 
may follows an exponential or any other distribution. This technique is called 
State Duration Sampling Approach. 



2.1 State Duration Sampling Approach 

This sampling process is based on the probability distribution of the compo- 
nent states duration. The chronological component state transition processes for 
all components are first simulated by sampling. The chronological system state 
transition process is then created by combination of the chronological component 
state transition process [1] . This approach is detailed in the algorithm below for 
the case of exponential probability distribution for the component state duration: 
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1. Specify the initial state of the system by the combination of the initial state 
of all components; 

2. Sample the duration of each component residing in its present state (U ) by: 

ti = -^\nUi (3) 

where Xi is the transition rate of the component and Ui is a uniformly dis- 
tributed random number between [0,1]. 

3. Repeat step (2) for the whole period (year) and record sampling values of 
each state duration for all components. 

4 . Create the chronological system state transition process by combining the 
chronological component state transition processes obtained in step (3) for 
all components. This combination is done by considering that a new system 
state is reached when at least one component changes its state. 

This approach is illustrated in Fig. 1 for a system composed of two compo- 
nents represented by a two-state stochastic model. 



up 



down i 



up 



down 



1 up 2 up 
1 down 2 up 
1 up 2 down 
1 down 2 down 



time 



time 



time 



Component 1 



Component 2 



System 



Fig. 1. State Duration Sampling 



3 Parallel Methodology 

One possible approach to parallelise the problem described above is to have a 
complete year analysed on a single processor and the many years necessary to 
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converge the process analysed in parallel on different processors. This implies 
that the parallel processing grain is one year simulation. However, this approach 
does not scale well with the number of processors and the number of simulated 
years for convergence [6]. 

A more scalable approach to parallelise the problem is to analyse each year in 
parallel by allocating parts of a one year simulation to different processors. Since 
within a yearly synthetic sequence the adequacy analysis of the system operating 
states should be chronologically performed, this parallelisation strategy requires 
a careful analysis of the problem and a more complex solution. 

The generation of the yearly synthetic sequence is a strictly sequential pro- 
cess, since each state depends on the previous one. However, most of the compu- 
tation time is not spend in the sequence generation itself, but in the simulation of 
the system adequacy at each state that compounds the sequence. In that sense, 
if all processors sample the same synthetic sequence, the adequacy analysis of 
parts of this sequence can be allocated to different processors and performed in 
parallel. Of course some extra care must be taken in order to group the partial 
results of these sub-sequences and calculate the yearly reliability indices. 

In the methodology used in this paper, the whole synthetic sequence is di- 
vided into as many sub-sequences as the number of scheduled processors, the 
last processor getting the remainder if the division is not exact. Each processor 
is then responsible for analysing the states within a particular sub-sequence. In a 
master-slave model, at the end of each sub-sequence analysis, the slaves send to 
the master their partial results and start the simulation of another sub-sequence 
in the next year. The master is responsible for combining all these sub-sequence 
results sequentially in time and compounding a complete year simulation. Since 
this methodology is asynchronous, the master has to keep track of the year that 
a sub-sequence result, which it receives, is related to and has to accumulate the 
results for the right year. Each time it detects that a year has been completely 
analysed, it calculates the yearly reliability indices and verifies the convergence 
of the process. When convergence is achieved, the master sends a message to all 
slaves to stop the simulation, calculates the process reliability indices, generates 
reports and terminates execution. 

The methodology precedence graph is shown in Fig. 2. Each processor has a 
rank in the parallel computation which varies from 0 to (p-1), 0 referring to the 
master process. The basic tasks involved are: I - Initialisation, A - Sub-sequence 
States Analysis, R - Reception and Control of Sub-sequences, C - Convergence 
Control, S - Individual States Analysis and F - Finalisation. A superindex k,i 
associated with a task means it is relative to the z-th sub-sequence within year k 
and a superindex k,i,j means it is relative to the j-th state within sub-sequence 
i in year k. 

Since sequential MCS simulates a chronological evolutionary process, a sys- 
tem state configuration is dependent on the topological evolution of the previous 
states. The adequacy analysis of a state will determine if it is an up or down 
state and adequate procedures must be taken, depending on the kind of tran- 
sition that lead to that state (up-up, up-down, down- up, down-down). In order 
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Pj (slave) 



Fig. 2. Precedence Graph 



to identify the transition that occurs between sub-sequences analysed in parallel 
on different processors, the last state of each sub-sequence is analysed on two 
processors: the one responsible for that sub-sequence and the one responsible for 
the next sub-sequence. This permits knowledge of whether the first state of a 
sub-sequence comes from an up or down state, allowing the proper action to be 
taken. This solution is illustrated in Fig. 3 for 4 processors. 




Fig. 3. Consecutive Sub-Sequences Transition 



A very important problem that must be treated in the division of a synthetic 
sequence, in sub-sequences to be analysed in parallel, is whether the border 
coincides with a failure sub-sequence of the whole synthetic sequence. A failure 
sub-sequence is a sequence of failure states, which corresponds to an energy 
supply interruption of duration equal to the sum of the individual failure state 
duration. If this coincidence occurs and is not properly treated, the duration 
related indices and their distribution along the months are wrongly evaluated 
because the failure sub-sequence is not completely detected and evaluated at 
the same processor. To solve this problem, the methodology forces a failure sub- 
sequence to be completely evaluated at the processor on which the first failure 
state of the sub-sequence occurs, as illustrated in Fig. 4. 
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failure sub-sequence 



Fig. 4. Failure Sub-sequences Treatment 



If the last state of a processor allocated sub-sequence is a failure state, which 
means that the simulation is within a failure sub-sequence, that processor carries 
on analysing the next states until a success state is reached, ensuring that the 
whole failure was analysed on it. In a similar way, if the state before the first, 
of a sub-sequence allocated to a processor, is a failure state, that processor 
skips all states until a success state is reached and starts to accumulate the 
results from this state on. This guarantees the correctness of the duration related 
indices evaluation and distribution but adds some computation overhead cost to 
the overall simulation. However, some extra costs must usually be added when 
parallelising any sequentially constrained application. 



4 Results 

This work was implemented on three different multicomputer platforms. 

1. A scalable distributed memory parallel computer IBM RS/6000 SP com- 
posed of 10 POWER2 processors interconnected by a high performance 
switch of 40 MBps full-duplex bandwidth and 50 fisec latency; 

2. A network of workstations (NOW) composed of 8 IBM RS/6000 43P work- 
stations interconnected by an Ethernet (lOBase-T) network. The peak band- 
width of this network is 10 Mbps unidirectional; 

3. A PC cluster composed of 8 Pentium III 500MHz personal microcomputers 
interconnected by a Fast-Ethernet (100 Base-T) network via a 12 ports 100 
Mbps switch. Each PC has 128 MB RAM and 6.0 GB IDE UDMA hard disk. 
The peak bandwidth and latency of the network is 100 Mbps unidirectional 
and 500 /rsec, respectively. The operating system running over the network 
is Windows NT 4.0. 

The message passing system used on the first and second platforms is the 
MPI implementation developed by IBM for the AIX operating system. On the 
PC cluster it uses the WMPI vl.2 [7], which is a freeware MPI implementation 
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developed at the Coimbra University, Portugal, for Win32 platforms. It is based 
on MPICH 1.1.2 and uses the ch-p4 device developed at Argonne National Lab- 
oratory (ANL) [8]. The implementations used on the three platforms comply 
with MPI standard version 1.1 [9]. 

Three different electric systems were used as tests to verify the performance 
and scalability of the parallel implementations. The first one is a representation 
of the New Brunswick power system (NBS) proposed by CIGRE as a standard 
for reliability evaluations [10]. This system has 89 nodes, 126 circuits and 4 
control areas. The second and third systems are representations of the Brazilian 
power system, with actual electric characteristics and dimensions, for Southern 
region (BSO) and Southeastern region (BSE), respectively. These systems have 
660 buses, 1072 circuits and 78 generators and 1389 buses, 2295 circuits and 
259 generators, respectively. A convergence tolerance of 5% in the coefficient of 
variation of the EPNS index was adopted in all simulations. 

The parallel efficiency obtained on 4, 6 and 10 processors of the IBM RS/6000 
SP parallel computer for the test systems, together with the CPU time of the 
mono-processor execution, are summarised on Table 1. 



Table 1. RS/6000 SP Results 



System 


CPU time 


Efficiency (%) | 


p=l 


p=4 


p=6 


p=10 


NBS 


30.17 min 


97.32 


94.64 


84.62 


BSO 


13.02 min 


93.57 


93.27 


82.23 


BSE 


15.30 hour 


97.81 


97.41 


91.20 



The application of the parallel methodology produces significant reduction 
of the simulation time required for reliability evaluation using sequential MCS. 
The efficiencies are very good, exceeding 82% for all test systems on 10 nodes. 
The methodology is scalable, with the number of simulated years required for 
convergence varying from 13 years for the BSO system to 356 for the NBS system, 
and also with the number of allocated processors. 

The parallel efficiency obtained on 4, 6 and 8 workstations of the NOW and 
the CPU execution time on one workstation are summarised on Table 2. 



Table 2. NOW Results 



System 


CPU time 


Efficiency (%) | 


p=l 


p=4 


p=6 


p=8 


NBS 


14.43 min 


87.96 


80.21 


73.11 


BSO 


6.18 min 


87.26 


72.29 


65.36 


BSE 


7.08 hour 


92.90 


91.57 


90.60 
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The analysis of the efficiency achieved on the NOW shows the higher commu- 
nication cost of an Ethernet network in comparison with the high performance 
switch of the RS/6000 SP. Most of the communication time at this platform 
is spent on the initial broadcast of the problem data and this is a consequence 
of two characteristics: first, the smaller bandwidth of this network and second, 
the fact that the MPI broadcast is a blocking directive, implemented as a se- 
quence of point-to-point communications. In an Ethernet bar network topology 
this costs more than in a multinode interconnected switch. The scalability can 
also be considered good especially for the larger system. 

The parallel efficiency obtained on 4 and 8 PCs of the cluster and the CPU 
execution time on one single PC are summarised on Table 3. 



Table 3. PC Cluster Results 



System 


CPU time 


Efficiency (%) | 


p=l 


p=4 


p=6 




NBS 


5.37 min 




85.33 


81.80 


BSO 


2.06 min 


92.37 


85.22 


75.28 


BSE 


2.28 hour 


98.64 


98.11 


96.25 



The results achieved on this platform can be considered excellent, particularly 
when taking into consideration the low cost, ease of use and high availability of 
the computing environment. The sequential simulation time is already smaller 
than for the other platforms as a consequence of the more modern and powerful 
processor used. The efficiency of the parallel solution can be considered very 
good, exceeding 96% for the larger and more time consuming test system on 
8 PCs. The parallel results show higher efficiency than the NOW ones mostly 
due to the higher bandwidth of the network and the use of a 100 Mbps switch 
in a star topology. As a consequence, the scalability of the methodology is less 
affected by the increase in the number of processors. 

The speedup curves of the parallel methodology are shown on Figures 5, 6 
and 7 for the RS/6000 SP, NOW and PC Cluster platforms, respectively. 



5 Conclusions 

The power system composite reliability evaluation, using the sequential MCS 
approach, simulates a realisation of the stochastic process of system operation. 
Power supply interruption duration and the probability distribution of duration 
related indices can be calculated, which is not possible using the non-sequential 
MCS approach. These issues are fundamental in production cost studies that 
are receiving more and more attention in the new competitive environment of 
power system markets. However, the major drawback of sequential MCS appli- 
cation is the high elapsed computation time required on conventional platforms 
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Fig. 5. RS/6000 SP Speedup Curve 
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Fig. 6. NOW Speedup Curve 
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Fig. 7. PC Cluster Speedup Curve 
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for elevated dimension system models. This paper presented a parallel method- 
ology for solving this problem, implemented on three different multicomputer 
platforms. The good results obtained in this work show that the computational 
cost incurred by parallelisation of a sequentially constrained problem are fairly 
compensated by the overall reduction of the simulation time. Even in the cases 
where efficiency is not that high, engineering production time saved by the use 
of parallel methodology justifies it use. Moreover, low cost platforms like a NOW 
or a cluster of PCs, which are usually available at any scientihc institution and 
at most electric utilities, support the adoption of parallel processing as a reliable 
and economic computing environment. 



References 

1. R. Billinton and W. Li, Reliability Assessment of Electric Power Systems Using 
Monte Carlo Methods, Plenum Press, New York, 1994. 

2. M.V.F. Pereira and N.J. Balu, “Composite Generation / Transmission Reliability 
Evaluation”, Proceedings of the IEEE, vol. 80, no. 4, pp. 470-491, April 1992. 

3. R. Billinton, A. Jonnavithula, “Application of Sequential Monte Carlo Simulation 
to Evaluation of Distributions of Composite System Indices”, lEE Proceedings - 
Generation, Transmission and Distribution, vol. 144, no. 2, pp. 87-90, March 1997. 

4. D.M. Falcao, “High Performance Computing in Power System Applications”, Lec- 
ture Notes in Computer Science, Springer- Verlag, vol. 1215, pp. 1-23, February 
1997. 

5. C.L.T. Borges and D.M. Falcao, “A Parallelisation Strategy for Power Systems 
Composite Reliability Evaluation (Best Student Paper Award: Honourable Men- 
tion)”, Lecture Notes in Computer Science, Springer- Verlag, vol. 1573, pp. 640-651, 
1999. 

6. C.L.T. Borges, “Power Systems Composite Reliability Evaluation on Paral- 
lel and Distributed Processing Environments”, PhD Thesis, in Portuguese, 
COPPE/UFRJ, Brazil, December 1998. 

7. J.M. Marinho, “WMPI vl.2”, http://dsg.dei.uc.pt/wmpi, Coimbra University, Por- 
tugal. 

8. R. Butler, E. Lusk, “User’s Guide to the p4 Parallel Programming System”, ANL- 
92/17, Mathematics and Computer Science Division, Argonne National Labora- 
tory, October 1992. 

9. M. Snir, S. Otto, S. Huss-Lederman, D. Walker and J. Dongarra, “MP7; The Com- 
plete Reference", The MIT Press, Cambrige, Massachusetts, 1996. 

10. CIGRE Task Force 38-03-10, “Power System Reliability Analysis - Volume 2 - 
Composite Power Reliability Evaluation!', 1992. 




A Novel Algorithm for the Numerical Simulation of 
Collision-Free Plasma-Vlasov Hybrid Simulation 



David Nunn 

Department of Electronics and Computer Science, 
Southampton University, 
Southampton, Hants, S017 IBJ, UK. 



Abstract. Numerical simulation of collision-free plasma is of great importance 
in the fields of space physics, solar and radio physics, and in confined plasmas 
used in nuclear fusion. This work describes a novel completely general and 
highly efficient algorithm for the numerical simulation of collision-free plasma. 
The algorithm is termed Vlasov Hybrid Simulation (VHS) and uses simulation 
partieles to construct particle distribution function in the region of phase (r,v) 
spaee of interest. The algorithm is extremely efficient and far superior to the 
classic particle in cell method. A fully vectorised and parallelised VHS code has 
been developed, and has been successfully applied to the problem of the 
generation of VLF triggered emissions and VLF 'dawn chorus', due to the 
nonlinear interaction of cyclotron resonant electrons with narrow band VLF 
band waves (-kHz) in the earth's magnetosphere. 



1 Introduction 

The problem of the numerical simulation of plasma is one of great importance in the 
realms of both science and engineering. The physics of the solar corona is essentially 
that of a very hot collision free plasma. Plasma physics governs the behaviour of radio 
waves in the whole of the earth's near space region, usually termed the 
'magnetosphere'. Closer to home plasmas employed in nuclear fusion deviees and 
industrial plasmas may well have time and spatial seales which make them effeetively 
collision-free, and understanding their dynamics is of vital importance. 

The equations governing any collision free (CF) plasma physics problem are those of 
Maxwell and Liouville. Liouville's theorem states that the density of partieles F(r,v) 
in 6 dimensional phase space r,v is conserved following the trajectories of particles in 
phase space. Clearly plasma physics problems may be immensely complicated, 
particularly if particle motion is non linear. Usually one must resort to numerical 
simulation to gain any comprehension of what is happening. 

Traditionally the methodology of choice for Collision Free plasma simulation was the 
classic particle-in-cell (PIC) method. The required spatial domain r or simulation box 
is covered by a suitable grid. A large number of simulation particles (SP's) are 
inserted into the simulation box and their trajectories followed aceording to the usual 
equations of motion. At each time step particle charge/currents are assigned or 
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distributed to the immediately adjacent spatial grid points, thus giving the 
charge/current field in the box. Use of the discretised Maxwell's equations allows one 
to time advance or 'push' the electric and magnetic field vectors in the r domain. PIC 
codes however suffer from several disadvantages. They are noisy, make inefficient 
use of simulation particles, and do not properly resolve distribution function in phase 
space. For problems involving small amplitude waves where the perturbation in 
distribution function dF is relatively small (dF«Fo) they are particularly noisy and 
inefficient. 



2 The Vlasov Hybrid Simulation Method (VHS) 

A novel and highly efficient simulation method has been devised termed Vlasov 
Flybrid Simulation (VFIS) [1]. The structure of the algorithm is as follows. A phase 
space (r,v) simulation box is first selected, to cover the domain of interest in the 
problem at hand. The maximum dimensionality of phase space is 6, but many realistic 
simulations have a reduced number of spatial or velocity space dimensions. The phase 
box may be a function of time as the simulation progresses. In the present case for 
example we are interested in electrons that are cyclotron resonant with the wave field 
and this phase box will cover the region of velocity space that is close to the 
resonance velocity. The box is filled with a grid to provide adequate resolution of 
distribution function in phase space. At the start of the simulation the phase box is 
evenly filled with particles at a density of about 1-2 per elementary grid cell. By 
Liouville's theorem distribution function F is conserved along phase trajectories. Each 
Simulation Particle (SP) is assigned a value of F appropriate to the initial conditions 
for the problem at hand. As the simulation progresses the SP trajectories in phase 
space are numerically integrated, in this case using a second order modified Euler 
algorithm. Thus the value of distribution function (F) is known at the points in phase 
space where the simulation particles happen to be located. Now at each time step the 
values of F at SP points are interpolated to the fixed phase space grid. This is 
achievable by a very simple procedure. The value of distribution function F* at each 
Simulation Particle number 1 is distributed additively to adjacent grid points using the 
familiar area weighting coefficients ai as employed in classical PIC codes. The 
weighting coefficients ai themselves are also distributed additively to adjacent grid 
points. For a specific grid point ijk we then have 







/ 



where the sum is over all simulation particles located in the 2” elementary hypercubes 
surrounding the grid point in question, where a phase space of dimensionality n has 
been assumed. This interpolation procedure lies at the heart of the VHS method and 
confers its many advantages. Once distribution function Fy^ is defined on a regular 
velocity space grid it is a simple matter to compute plasma current and charge fields 
in 3D cartesian space, by appropriate integration (summation) over the velocity space 
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grid. Following this one may push the EM fields forwards in time using a discretised 
representation of Maxwell's equations. 

2.1 Particle Control 

Fortunately from Liouville's theorem itself there is no tendency for SP's to bunch and 
leave grid points 'uncovered'. Where this does occur a value for Fjji; may be secured by 
interpolation from neighbouring grid points. At any time extra SP's may be inserted 
into (or removed from) the phase fluid-they only act as markers providing information 
about the value of distribution at a particular point. Unlike all other techniques the 
density of SP's is not a critical quantity. It only needs to remain at a value greater than 
~1 per elementary phase space volume. For some problems, and this is particularly 
true in the present case, there will be a flux of phase fluid out of or into the simulation 
phase space box along its boundaries. Particles leaving the phase box are discarded as 
they convey information no longer required. Where phase fluid enters the box it is 
necessary to insert new SP's into the phase fluid at that point. This has to be done with 
some care in order to attain an acceptable density of simulation particles in the 
incoming phase fluid. It is the interpolation procedure that makes it legitimate and 
possible to do this. This is a very powerful feature of VFIS. The population of 
simulation particles is dynamic and constantly changing. 

2.2 Advantages of VHS 

VHS has been found to be highly efficient and to have very low noise levels when 
compared to PIC codes. Very efficient use is made of the simulation particles, as they 
carry information as to the value of F (or rather dF). Unlike other Vlasov simulation 
techniques that have been developed the algorithm is very stable and robust. For 
example the standard method of Cheng and Knorr [2] aims to solve numerically the 
Vlasov equations in phase or configuration space. This requires the determination of 
the gradient of distribution function in phase space. This presents severe practical 
problems. In many plasma simulation problems particle distribution function acquires 
quite legitimately fine structure in phase space, often termed 'filamentation'. For 
example this may arise in wave particle interaction problems in plasma when particles 
become phase trapped in a narrow band wave. Such filamentation makes the Cheng 
and Knorr algorithm numerically unstable against filamentation in velocity space. 
Attempts to resolve this problem involve techniques such as numerical smoothing, 
which corrupts the underlying physics being simulated. Another feature of VFIS is 
that for certain problems, one may limit the region of phase space where F is resolved 
to a time varying simulation box.This is indeed the case in the present problem. The 
ability to accommodate a flux of phase fluid across the boundary of the phase box is 
unique to VFIS and allows the particle population to be dynamic and to change 
constantly. In this way the particle population is confined to a set that is locally 
optimal in time. For example in the present problem particles are constantly drifting 
into and out of resonance with the wave. A PIC code would end up following large 
numbers of non resonant particles, but a VHS code will constantly discard non 
resonant particles and continually introduce new resonant particles. The benefits in 
computational time this confers cannot be over estimated. 
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Another virtue of the VHS method is that distribution function is properly resolved 
in phase space and is available as a diagnostic output. Distribution function is only 
available from a PIC code by numerically inspecting the density of (weighted) 
simulation particles in velocity space. This is actually rarely done with PIC codes, and 
if it were one would quickly realise that the density of SP's was grossly inadequate to 
define F, let alone dF. It is a fact that PIC codes, particularly in applications with high 
dimensionality, often have inadequate numbers of simulation particles. The noise 
level is then extremely high, and the authors are relying on integration over time and 
over velocity space (in the evaluation of J ( r ) and p( r ) ) to reduce the noise to 
manageable levels. 



3 The Application Area 

The VHS algorithm has been fruitfully applied here to a classic problem in space 
plasma physics. This is the generation mechanism of triggered emissions and chorus 
in the VLF band (3-30kHz) in the earth's magnetosphere. Triggered emissions are 
narrow band signals with sweeping frequency. Typically the frequency may rise or 
fall by several kHz in a time ~l-2 secs. More complex spectral forms are often 
observed, such as downward hooks, upwards hooks, quasi constant tones and 
emissions whose frequency oscillates. Emissions are generated as a result of nonlinear 
cyclotron resonant interaction between the the EM wave and energetic radiation belt 
electrons of ~keV energy. Emissions achieve quite strong amplitudes of B'=2-10pT, 
which represents a wave strong enough to nonlinearly 'trap' cyclotron resonant 
electrons. It is generally agreed that chorus and VLF emission arise in 'ducts' where 
the wave vector is closely parallel to the ambient magnetic field direction. A key 
aspect of the nonlinear wave particle interaction is the dominant role of the magnetic 
field inhomogeneity, which controls particle trapping dynamics and confines the 
interaction region to the equatorial zone. Consequently we have developed a 
VHS/VLF code with 1 spatial dimension and 3 velocity dimensions to simulate this 
nonlinear self consistent interaction in the equatorial zone of the earth's 
magnetosphere. The region of generation is typically between 3 and 10 earth radii in 
altitude and some 1000s of kms in extent, spread along equatorial magnetic field 
lines. This problem is extremely well suited to the VHS method-indeed this 
simulation has not been successfully achieved with any other type of simulation 
method, and PIC codes have shown themselves to be quite incapable of simulating 
this phenomenon. The phase box encloses the cyclotron resonance velocity vres 

Vre.s = (CO - □ )/k 

where co is wave frequency, F2 is electron gyrofrequency and k is wave number. Note 
that resonance velocity is in the opposite direction to wave phase and group 
velocities. The resonance velocity will vary in both space and time, through changing 
frequency of the emission, and significantly through inhomogeneity of the ambient 
magnetic field, which has a parabolic dependence on distance z from the equator . 
Thus particles are constantly entering the phase box, which is the region of resonance 
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and thus of direct physical interest. It is thus guaranteed that all SP's are close to 
resonance. 



4 The VHSA^LF Code 



The code has been developed in Fortran?? and has been run on a wide variety of 
platforms, namely 0rigin2000, DEC Alpha cluster, Convex Exemplar, Cray YMP etc. 
The most numerically intensive procedures are the particle push routines, and the 
process of interpolating distribution function from particles to the fixed grid. The 
particle push routines fully vectorize, but the interpolation procedure does not due to 
its logical complexity. The whole code has been parallelised using MPI, which has 
been easily achieved by means of the following technique. The ID spatial domain is 
divided into M adjacent blocks, where M is the number of available processors. Each 
processor implements the particle push and interpolation procedures in its part of the 
spatial grid. At each step, those particles which physically move from one spatial 
domain to the next must be passed with their appertaining data between adjacent 
processors at the interface. The field push equations and certain global operations 
such as FFT/IFFT filtering of the EM wave fields are low work load operations and 
are performed by the master processor. All processors must pass current field data to 
the master at each timestep, where field push and field filtering are performed. The 
master then returns the new global EM wave fields to the 'slaves' who then perform 
the particle push and distribution function interpolation for the next timestep. 

The simulation takes place within a finite frequency band located about a centre 
frequency which itself may be a function of time. The simulation bandwidth is ~ 
?0Flz which requires a spatial grid -1600 in order to resolve all Fourier components 
of the wave spectrum. The velocity space grid must be dense enough to resolve the 
structure of the distribution function in the region about the resonance velocity. The 
dominant structure is the so called 'resonant particle trap' and it was found that having 
50 grid points in the Vz axis parallel to the Bo direction and 20 points in gyrophase 
gave adequate resolution. The total number of phase space grid points and thus the 
number of simulation particles is thus typically in the range 0.5-5 million. A short run 
may take only a few hours on an Origin2000. However run time scales as bandwidth 
cubed, so high bandwidth runs may take as long as a week. 



5 The Observational Data 

Radio emissions in the VLF (kHz) band, the so called VLF emissions, may occur 
spontaneously or be obviously triggered by some other signal. The first observations 
of triggered VLF emissions were obtained on the earth's surface on US Navy vessels. 
Morse code signals at 14kHz from the high power VLF transmitter NAA at Cutler, 
Maine were observed to 'trigger' long enduring radio emissions (~1 second) with a 
sweeping frequency ~2kHz/sec. [3]. In pioneering research it was realised by 
Helliwell [3] that these emissions must arise in the earth's magnetosphere and be due 
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to non linear electron cyclotron resonance with radiation belt electrons with energies 
~keV. Since that time triggered VLF emissions have been routinely observed on the 
ground, particularly at Halley Bay, Antarctica [4] and in Northern Scandinavia [5]. In 
the 1970's Stanford University established a horizontal VLF antenna in Antarctica at 
Siple station on the South Polar plateau. [6], An extensive program of VLF 
transmissions were made to probe the magnetosphere and investigate the phenomenon 
of triggered emissions. One of the main objectives of the research program described 
here has been to develop the theory and numerical simulation tools to fully understand 
the many extraordinary phenomena observed in the Siple data base. 

Since triggered emissions are generated in space, it is not surprising that this 
phenomenon has also been observed on board scientific satellites. Unfortunately the 
VLF radio waves are confined to field aligned ducts caused by localised 
enhancements of plasma density. These ducts are ~ 100km in extent and it is only 
infrequently that a satellite will pass through a duct. Consequently satellite 
observations can be rather disappointing. However at large distances from the earth 
,~10 earth radii, VLF signals are not ducted and satellites there record a variety of 
VLF chorus and triggered emissions. A recent paper by Nunn et al (1997) [7] presents 
VLF emission observations from the Geotail satellite and uses the VHS simulation 
code to produce almost exact replicas of emissions observed, using all the field and 
particle observations from on board the satellite. These results confirmed totally the 
plasma theory underlying this phenomenon. 



Frequency/time plot of VLF emission 
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6 Numerical Modelling of Siple Triggered Emissions 

The VHS/VLF code has been used to successfully simulate the triggering of a rising 
frequency emission triggered by a CW 70ms pulse at 3663Hz from the Siple 
transmitter. Figure 1 above displays a frequency-time contour plot of the output 
wavefield sequence as recorded at the end of the simulation box. The sweep rate of 
IkHz/s is in excellent agreement with observations on the ground and on board 
satellites. 

The emission itself is produced by a quasi static non linear self consistent and self 
maintaining structure termed a VLF soliton or generating region. This soliton is stable 
in nature, both in reality and in the simulation code. The profile of the riser soliton is 
shown in figure 1 . The code has completely elucidated the dynamical structure of the 
VLF soliton, and identified two distinct types, one associated with a riser and one 
with a faller. 

The code is also able to reproduce fallers with a suitable choice of initial parameters. 
Both downward and upward hooks may be produced by the code and these are 
interpretable in terms of transitions between the two soliton types. The sweeping 
frequency is due to the out of phase component of resonant particle current that sets 
up spatial gradients of wave number in the wave field and is able to sustain these. The 
top panel of figure 2 shows d/dz(Ji/|R|), where Ji is the out of phase component of 
resonant particle current and R the complex field. This quantity is the 'driver' that sets 
up the appropriate wave number gradients. 



Plot of d/dz{Ji/IRI) in Hz/s 





Fig 2. 
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7 Conclusions 

The VHS method for numerical simulation of collision free plasma is low noise, 
highly efficient, very stable and provides excellent diagnostics. In this application the 
method has been successfully used to simulate triggered radio emissions in the VLF 
band in the earth's near space region. This is a complex and difficult problem which 
has never been solved using PIC codes. The VHS method far outperforms particle in 
cell codes in all applications where resolution of the distribution function in velocity 
space is required. The method is completely general and may be safely applied to 
ANY collision free plasma simulation problem. Problems with a high dimensionality 
will be expensive if tackled with a Vlasov VHS code. However use of a properly 
constituted Vlasov code guarantees accuracy and meaningful results. It is all too easy 
to run PIC codes with far too few particles, and to obtain results which although often 
plausible are in fact heavily corrupted by simulation noise. 



References 

1. Nunn,D. : A Novel Technique for the Numerical Simulation of Hot Collision Free 
Plasma-Vlasov Hybrid Simulation. J. of Computational Physics, vol 108(1) (1993) 
180-196. 

2. Cheng, C.Z., Rnorr,G.: The Integration of the Vlasov Equation in Configuration Space. J. 
Geophysical Research, (1976), vol 95 15073 et seq. 

3. HelliwelfR.A. : Whistlers and Related Ionospheric Phenomena, Stanford University 
Press, Stanford, Califomia,USA,(1965). 

4. Smith A.J. and Nunn,D. : A Numerical Simulation of VLF Risers, Fallers and Hooks 
observed in Antarctica, J. Geophysical Research, 103,(1 998) 6771-6784. 

5. Nunn,D.,Manninen,J., Turunen,T.,Trakhtengerts,V., and Erokhin,N.: On the Nonlinear 
Triggering of VLF Emissions by Power Line Harmonic Radiation, Annales 
Geophsicae, 1 7,(1 999), 79-94. 

6. HelliwelfR.A., Controlled Stimulation of VLF Emissions from Siple Station, Antarctica, 
Radio Science, 18, (1983), 801-814. 

7. Nunn,D.,Omura,Y.,Matsumoto,H.,Nagano,L, and Yagitani,S.: The Numerical Simulation 
of VLF Chorus and Discrete Emissions Observed on the Geotail Satellite Using a Vlasov 
Code, J. Geophysical Research, 102,A12,(1997),27083-27097. 




An Efficient Parallel Algorithm for the 
Numerical Solution of Schrodinger Equation 



Jesiis Vigo- Aguiar^, Luis M. Quintales^, and Srinivasan Natesan^ 

^ Corresponding author. Dept, of Mathematical Sciences, 
University of Wisconsin-Milwaukee. 

PO Box 413, Milwaukee, WI, 53201, USA. 
jvigo@gugu.usal . es . 

^ Dept, of Informatica, University of Salamanca. 

E-37008 Salamanca, Spain. 
lamq@gugu.usal . es . 

® Dept, of Mathematics, Bharathidasan University, 
Tiruchirappalli 620 024, Tamilnadu, INDIA. 
matnat@bdu.emet . in 



Abstract. In this paper we show how to construct parallel explicit mul- 
tistep algorithms for an accurate and efficient numerical integration of 
the radial Schrodinger equation. The proposed methods are adapted to 
Bessel functions, that is to say, they integrate exactly any linear combi- 
nation of Bessel and Newman functions and ordinary polynomials. They 
are the first of the like methods that can achieve any order. The coeffi- 
cients of the method are computed in each step. We show how the parallel 
implementation of the method is the key of an efficient computation. 



1 Introduction 

The behavior of a spinless quantum particle of mass m in a potential v{X), 
X = (xi,X 2 ,X 3 ) is governed by the three-dimensional Schrodinger equation 

^Ay{X) + {v{X) - e)y{X) = 0 (I) 

where Z\ is the Laplace operator, h is the reduced Planck’s constant and e is 
the particle energy. The solution y{X) can be expanded on the complete set of 
spherical functions 



OO I 

^ ( 2 ) 

m=—l 

where x,p,6 are the spherical coordinates of the point x. Introducing this ex- 
pansion in the equation and operating, we find that yi{x) satisfies 

y'lix) = {U{x) - P{x)) yi{x), (3) 
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where 



= p(,) = e-‘l± 



1 ) 



(4) 



and U (x) is a given potential. The solution of the equation must vanish at 
the origin, i.e. one boundary condition is yi{0) = 0, and the other boundary 
condition, which depends on the physical model, is imposed at large x. 

Equation (3) is usually know as radial Schrodinger equation. And the problem 
of integrating (1) has been transformed to the integration of a infinite set of 
second order differential equations. Then it is obvious that we need methods 
with small CPU times. The use of parallel procedures and adequate multistep 
methods allow fast and accurate integration. 

In the computation of the eigenvalues or the phase shifts of the radial Schro- 
dinger equation, usually the potential U{x) tends to zero much faster than the 
centrifugal potential — P{x) = l{l + l)/x^ and then the solution of (3) may 
be taken as 

y{x) = cteikxji{kx) + cte 2 kxni{kx) (5) 



where ji{x) and ni{x) are respectively the Spherical Bessel and Neumann func- 
tions. It is our intention to develop a method that integrates exactly any linear 
combination of this functions and ordinary polynomials. This property is known 
as Bessel fitting or adaptation to Bessel functions. The theory and a procedure 
to construct adapted multistep methods to trigonometric and exponential func- 
tions is nowadays solved and can be found in [8]. Theory and procedure for 
adaptation to other types of dynamic behavior is still an open question. 

The difficulty of construction of methods adapted to Bessel functions is evi- 
denced by the fact that there exist only a few satisfactory papers on the subject 
(see for example, Raptis and Cash [3] and Simos and Raptis [5] ) The methods 
of Raptis and Cash produce accurate solutions in the phase shift problem that 
they proposed in spite of being second and fourth order methods. However in 
their methods the coefficients depend on the point where we are calculating the 
solution and so they must be recalculated at every step, with high computational 
cost. This is the point where parallel implementation is fundamental. It is our 
goal to formulate higher order Bessel fitting methods with the possibility that 
the coefficients can be computed at the beginning of the program in parallel, 
thus allowing a significant reduction in the computational cost. 



2 Bessel and Neumann Fitting Methods 

To construct our procedure let us consider the differential equation 

y" + P{x)y = f{x,y) (6) 

Our first observation is that the sequence ?/„ = cteik n hjiijih) + 
ctc 2 k n hni{nh) is the solution of the difference equation 

yn+l + dl{Xn+l,X„-i,h)yn + dl{Xn+i,Xn,h)yn-i = 0 

yo = 0 (7) 

yi = cteik hji{h) + ctc 2 k hni{h) 
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where 



di{a, b, h) 



kaji{ka) kbji{kb) 
kani{ka) kbni{kb) 

k{a — h)ji{a — h) k{a — 2h)ji{a — 2h) 
k{a — h)ni{a — h) k{a — 2h)ni{a — 2h) 



and II II denotes the determinant. 
Then the problem 



y" + P{x)y = 0 



(8) 



(9) 



is integrated exactly with the proposed difference equation. 

The construction of the discretization scheme is completed with the treat- 
ment of the right-hand side f{x,y) in (6). In the theory of classical multistep 
methods, f{x,y) is approximated by a interpolatory polynomial in the previous 
steps. The same proceeding is done here. The expression of the Bessel fitted 
method applied to (6) is: 



fc 

yn-t-l dl (Xyi-i-i , Xji—i , h^yn d[ , Xji^ h'jyji—i — h ^ ^ i j l/n-t-1 — z) 

i=0 

(10) 

We impose that the method integrates exactly the interpolation polynomial 
of f{x,y) requiring the method to be exact when we integrate the equations 

y"{x) + P{x)y = P{x)x’^ + m{m — (11) 

for m = 0, 1, ■ • ■ , /c. With this condition we obtain the nonsingular system of 
linear equations for af. 

Aa = Q (12) 

where A is 
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(15) 
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The solution of this system of equations can be done for 0 < fc < 10 with the 
help of a symbolic manipulator. The resulting scheme will be named PSBF (Par- 
allel Spherical Bessel Fitted method) in the following. Note that the coefficients 
a are recalculated once in each step. That is a characteristic of all the methods 
that integrate exactly linear differential equations without constant coefficients. 

Our method is an implicit method, in the same way we could have deduced 
an explicit method. However when we apply our method to equation (3) we can 
obtain an explicit procedure: 



-1 



2/n+l 



' U(Xn+l) 



{d/ ( , Xn—i j h'jyji di ( , Xn , h'jyji—i 



E CXiV (Xn-t-l — z)yn-t-l— 



(16) 



Given the good properties of stability of the implicit methods we have con- 
sidered unnecessary to use an explicit method. 



3 Parallel Implementation and Properties 

We will give a brief explanation of the convergence of the method (detailed proofs 
will appear in a different paper). 

Theorem The multistep method PSBF of fc -I- 1 steps is consistent of order 
k + 1. Its local truncation error can be expressed as 

£p{y, h){x) = h'^+^ak+iP{D)y{x) + 0{h'^+^) (17) 

where P{D)y[x) is certain combination of y{x) and its derivatives. The method 
integrates without local truncation error the problems (3) whose solution belongs 
to the space generated by the linear combinations of 

1, X, xji{x),xni{x) (18) 

Observe that this method reduces to the classical Cowell method (Henrici 
1962) when P{x) = 0. For k = 2 the methods reduces to the popular Numerov 
method. 

What makes the method different from standard methods is that the co- 
efficients are recalculated in each step. This fact produces an increase in the 
computational cost, however this cost is minimized if we use the parallel imple- 
mentation proposed in this paper. 

We observe that once the grid has been selected the coefficients aj, di, and 
(^ 2 , can be calculated independently at each point x„. Then at the beginning of 
the program we compute in parallel these coefficients for all the points x„ of the 
grid. In the same way we compute all the values of the potential at each point 
at the beginning of the program. We have called this phase initialization phase. 
The following diagram explains the idea (see Table 1). 
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Table 1. Diagram for the initialization. Order of the method k, number of total 
steps in the integration m n 



Processor 1 




Processor m 


di,d,2 




dl, d.2 


ai • ■ ■ ak 




ai - ■ • ak 


at the points xi - ■ ■ x„ 




at the points X{n-\)m+i ■ ■ ■ x„m 



Table 2 shows the execution time of this initial processes for a method of 
order k = 6 and total number of steps 296. We integrate a single equation and a 
system of dimension 80 (? = 0 ■ • ■ 79). We show how the speed-up is close to the 
the number of processors. When we are using a scheme with constant coefficients 
the CPU time of the initialization is only due to the computation of the potential 
in the grid. 



Table 2. 



num. of processors 


T/CPU (1 Eq.) 


Speed-up 


T/CPU (80 Eq.) 


2 


0.0254 sec. 


1.92 


2.0 sec. 


3 


0.0183 sec. 


2.54 


1.5 sec. 


4 


0.0143 sec. 


3.08 


1.1 sec. 



The following figures show a snapshot of the initialization of the parallel 
process. The green color represents computation time of each processor. The 
yellow/red color represents communication times. The green zone in the pro- 
cessor 0 represents the integration. It can be observed that the final speedup 
is roughly related to the ratio between the green and yellow areas during the 
initialization phase. 

We would like to point out in this section that if the equation we are inte- 
grating needs an explicit method for its computation, a predictor and a corrector 
method can be obtained using the following recurrences 



Vn+A + dl (Xn+4 , , 2/l)?/^+2 + (Xn+4 , Xn+2 ,“2h)y^ 

Vn+3 + dl {Xn+3 , Xn+1 , h)vl+2 + dl {Xn+3 , Xn+2 , h)yl^^ 



(19) 
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Fig. 1. Parallel execution snapshot with 2 processors 




Fig. 2. Parallel execution snapshot with 4 processors 
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These recurrences and the procedure mentioned in section 2 allows us to 
write the methods in the form 

Vn+4 + dl{Xn+4, Xn, 2/l)2/“+2 + dl{Xn+4, Xn+2, = 

k 

= h^aofl+z + X] yl+i-i) 

(20) 

Vl+Z + dl{Xn+3, Xn+l,h)y^_^2 + dl(Xn+3, Xn+ 2 , = 

k 

= h'^Pofn+3 + PifiXn+l-i, y^+l-i), 
i=l 

where the coefficients and f3i are solution of a system of equation similar to 
(12). The implementation in this case is similar to the one proposed in [9]. 

4 Numerical Examples 

In order to test the accuracy of the proposed procedure we apply it to the solution 
of equation (3) using as U (x) the Leonard- Jones potential which has been widely 
discussed in literature. For this problem the potential has been taken as in Simos 
and Raptis 

U{x) = m{^ - ^) (21) 

where m = 500. 

The considered problem is the computation of the relevant phase shifts. We 
initialize the integration with the popular Numerov method using a small step. 
We do not take in account the step given by the Numerov method in the results 
presented. 

We consider (following for example T. Simos) the asymptotic form of the 
solution 



y{x) Rs Akxji{kx) — Bkni{kx) ~ AC{su4{kx — ^) 

-I- tan 5i cos{kx -■§-)) 

where 6i is the phase shift that may be calculated from the formula 

^ y(a:2)5'(a:i) - y{xi)S{x2) 
y{x^)C{x2) - y{x2)C{x^) 



(22) 



(23) 



for xi and X 2 distinct points on the asymptotic region. We take as asymptotic 
region a: > 15 and xi = 15 and X 2 = 15 — h, h being the step size. Here 
S(x) = kxji{kx) and C{x) = kxni{kx). 

Since the problem is treated as an initial-value problem, one needs yo and 
yi before starting the Numerov method. As we have mentioned yo = 0, and 
following [4,1] the solution behaves as constant by x^^^ as a: — > 0. According to 
this we take y\ = . 
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In the next table, we have chosen fc = 5 and we represent the error with 
respect to the true phase shift of the proposed method using order 6, the results 
can be compared with those obtained by Simos [5]. 



Table 3. k=5. Accuracy in phase shift. Order 6. Number of steps 292, h = 0.05 



Phase shift Phase shift Error 



(True) (Computed) 



0 


-0.4831 


-0.4832 1 X 10"'‘ 


1 


0.9282 


0.9277 5 X 10"^ 


2 


-0.9637 


-0.9639 2 X 10"^ 


3 


0.1206 


0.1170 36 X lO""' 


4 


1.0328 


1.0349 21 X lO’"' 


5 


-1.3785 


-1.3779 6 X 10~^ 


6 


-0.8441 


-0.843 8 X lO""' 


7 


-0.5244 


-0.5256 12 X lO""' 


8 


-0.4575 


-0.4575 


9 


-0.7571 


-0.7571 


10 


1.4148 


1.4148 



All computations were carried out on a Silicon Graphics Origin 200 Server 
with four processors MIPS RIOOOO and the MPI library LAM 6.3 [2]. In the 
present architecture communication is an operation of write/read using the 
shared memory. We have used FORTRAN and Double precision arithmetic with 
16 digits accuracy. 

Conclusion: As we can see, the fact that we need to compute all the coefficients 
in each step means a computational cost of few seconds, even if we are working 
with big systems of ODEs. However we are obtaining a significant improvement 
in the precision. In the opinion of the authors, the effectiveness of the method 
proposed in this work has been demonstrated since parallel machines with at 
least a few processors are nowadays quite commonly available. 
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Chapter 3: 

Linear and Non-linear Algebra 



Introduction 



The majority of the papers in this chapter are concerned with linear and non- 
linear algebra. The invited talk by Mark Stadther, on Parallel Branch- and- Bound 
for Chemical Engineering Applications: Load Balancing and Scheduling Issues 
revisits and adapts the branch-and-bound to parallel computing. 

Castro and Frangioni also discuss the solution searching-problem but in the 
context of linear programming; the block-angular structure of the coefficient ma- 
trix associated with an interior-point algorithm is exploited for parallel iterative 
solution. 

Toeplitz matrices, encountered in many applications, are often solved by 
techniques based on a generalized Schur algorithm. Alonso et al. present a par- 
allel algorithm for solving the Toeplitz least square problem that maintains the 
stability properties of the sequential method and shows good efficiency with a 
reduced number of processors. 

Schram discusses the application of concepts based on structured infinite 
index-domain to multi-grid algorithms for the solution of boundary value prob- 
lems. 

Solution of tridiagonal systems by parallel decoupling is the subject of the 
paper by Amor et al.; high paraqllel efficiency, over 91%, is shown for implemen- 
tations in a distributed memory machine. 

Peinado and Vidal focus on the solution of the inverse eigenproblem for real 
symmetric Toeplitz matrices; a parallel algorithm based on Newton-like methods, 
implemented on a shared memory architecture and a cluster of 20 PCs, shows a 
speed-up close to unity. 

Forjaz and Ralha evaluate a parallel implementation of the zeroNr method 
for the symmetric tridiagonal eigenvalue problem; their results show that on 
a multiple pipeline topology, 7 pipelines with 16 processors each reduces the 
communications overhead, however without retaining the load balancing char- 
acteristics of the single pipeline made up of 112 processors, with efficiency above 
95% for all cases being tested. 

Arnal et al. discuss the results of non-stationary parallel Newton iterative 
methods on two platforms; they report efficiencies of 90% and 60% with 2 and 
4 processors. 
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The paper by Castillo et al. discusses the results of block-partitioned and 
parallel algorithms used for solving the so called pole assignment of single-input 
systems, of relevance in the design of linear control systems. 

Speed of computation is crucial in real-time control applications, reducing 
the importance of code-portability and supporting the development of software 
applications for specific computer architectures. The study by Martinez et al. 
is one of such cases; they discuss the design and implementation of a systolic 
library, an algorithm for solving the generalized Sylvester equation; i.e reusable 
systolic arrays that can be implemented in reconfigurable architectures based in 
FPGA devices. 




Parallel Branch-and-Bound for Chemical 
Engineering Applications: Load Balancing 
and Scheduling Issues 
Invited Talk 



Chao- Yang Gau and Mark A. Stadtherr* 

Department of Chemical Engineering, 182 Fitzpatrick Hall, 
University of Notre Dame, Notre Dame IN 46556, USA 
markstSnd . edu 



Abstract. Branch-and- prune (BP) and branch-and-bound (BB) tech- 
niques are commonly used for intelligent search in finding all solutions, 
or the optimal solution, within a space of interest. The corresponding 
binary tree structure provides a natural parallelism allowing concurrent 
evaluation of subproblems using parallel computing technology. Of spe- 
cial interest here are techniques derived from interval analysis, in partic- 
ular an interval-Newton/generalized-bisection procedure. In this context, 
we discuss issues of load balancing and work scheduling that arise in the 
implementation of parallel BB and BP, and describe and analyze tech- 
niques for this purpose. These techniques are applied to solve problems 
appearing in chemical process engineering using a distributed parallel 
computing system. Results show that a consistently high efhciency can 
be achieved in solving nonlinear equations, providing excellent scalabil- 
ity. The effectiveness of the approach used is also demonstrated in the 
consistent superlinear speedup observed in performing global optimiza- 
tion. 



1 Introduction 

The continuing success of the chemical and petroleum processing industries de- 
pends on the ability to design and operate complex, highly interconnected plants 
that are profitable and that meet quality, safety, environmental and other stan- 
dards. Towards this goal, process modeling, simulation and optimization tools 
are increasingly being used industrially in every step of the design process and in 
subsequent plant operations. To perform realistic and reliable process simulation 
and optimization for industrial scale processes, however, requires very large scale 
computational resources. Parallel computing technology offers the potential to 
provide the necessary computational power. However, since most currently used 
problem solving techniques in process modeling and optimization were devel- 
oped for use on conventional serial machines, it is often necessary to rethink 
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problem solving strategies in order to take full advantage of parallel computing 
technology. 

In this context, we are particularly interested in the use of parallel computing 
technology to address reliability issues that arise in solving process engineering 
problems. The models that must be solved in process simulation problems are 
typically highly nonlinear and may have multiple solutions. The goal is to find 
all solutions, to insure that the solution or solutions of interest are not missed. 
Similarly, in optimization problems, the nonlinear programming problems to 
be solved are typically nonconvex, and there may be several local optima. The 
goal is to find the global optimum, though in some problems finding all of the 
local optima may be of interest as well. The approach we apply involves the 
use of interval analysis, combined with branch-and-prune (BP) or branch-and- 
bound (BB) strategies. Properly implemented, such techniques can find, or more 
precisely enclose, all solutions to a system of nonlinear equations, and can be 
used to enclose the global optimum, or all local optima, in optimization problems. 
This can be done with mathematical and computational certainty. 

Since the subproblems (tree nodes) generated in the tessellation step in BB 
and BP algorithms are independent, these techniques are particularly amenable 
to parallel processing. In this paper, we focus specifically on issues of load balanc- 
ing and scheduling that arise in the implementation of parallel BB and BP, and 
describe and analyze techniques for this purpose. An application to a problem 
arising in chemical process engineering is used to demonstrated the effectiveness 
of the approach used. 

2 Distributed Parallel Computing 

The solution of realistic, industrial-scale simulation and optimization problems 
is computationally very intense, and requires the use of adequate computational 
resources to be done in a timely manner. High performance computing (HPC) 
technology, in particular parallel computing, provides the computational power 
to realistically model, simulate, design and optimize complex chemical manu- 
facturing processes. To better use these leading edge technologies in process 
simulation requires the use of techniques that efficiently exploit parallel compu- 
tational resources. One of major trends in this regard is the use of distributed 
computing systems. Typically, in this sort of system, memory is physically dis- 
tributed, and communication may be done by message passing through some 
interconnection network. 

The use of parallel processing in chemical engineering has attracted signifi- 
cant attention over the past decade or so. There are a variety of applications for 
which a distributed approach to parallel computing has proven to be effective. In 
chemical process systems engineering, some examples, that involve either actual 
implementation on distributed systems, or algorithms appropriate for distributed 
computing, can be seen in the field of deterministic global optimization and reli- 
able nonlinear equation solving (e.g., [1,2, 3, 4, 5, 6, 7, 8, 9]), nondeterministic global 
optimization (e.g., [10,11,12]), BB in process scheduling (e.g., [13,14,15,16]), BB 
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in process synthesis (e.g., [10,17,18,19]), and process simulation, analysis and 
optimization (e.g., [20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39]). 
There are also a number of important application areas outside of process sys- 
tems engineering (e.g., [40,41,42,43,44,45]). 

The type of distributed parallel system of particular interest here is a cluster 
of workstations (COW), in which multiple workstations on a network are used 
as a single parallel computing resource. This sort of parallel computing system 
has advantages since it is relatively cheap economically, and is based on widely 
available hardware. Thus, such an approach to parallel computing has become a 
important trend in providing high performance computing resources in science 
and engineering. 



3 Branch-and-Bound 

Branch-and-prune (BP) and branch-and-bound (BB) algorithms are general- 
purpose intelligent search techniques for finding all solutions, or the optimal 
solution, within a space of interest, and have a wide range of applications. These 
techniques employ successive decomposition (tesselation) of the global problem 
into smaller disjoint or independent subproblems that are solved recursively until 
all solutions, or the optimal solution, are found. BB and BP have important 
applications in engineering and science, especially when a global solution to an 
optimization problem, or all solutions to a nonlinear equation solving problem 
are sought. In chemical engineering, these applications include process synthesis 
(e.g., [10,17,18,19]), process scheduling (e.g., [13,14,15,16]), analysis of phase 
behavior (e.g., [46,47,48]), and molecular modeling (e.g., [49]). 

In BP, a subproblem is typically processed in some way to verify the existence 
of a feasible solution. The subproblem may be examined by a series of tests, and 
is pruned when it fails specihed criteria or if a unique solution can be found inside 
this subdomain. If no conclusion is available, and so the subproblem cannot be 
pruned, the problem is bisected into to two additional subproblems (nodes), 
generating a binary tree structure. One of the subproblems is then put in a 
stack and tests are continued on the other. This type of BP procedure is one 
of the basic ideas underlying the application of interval analysis to equation- 
solving problems. More details on interval analysis, in the particular interval- 
Newton/generalized-bisection (IN/GB) method, are presented in next section. 
When solving a system of nonlinear equations, the pruning scheme consists of 
a function range test and the interval-Newton existence and uniqueness test. 
There are three situations in which an interval (node) can be pruned: (1) there 
is some component of the function range that does not contain zero; (2) a unique 
solution is proven to be enclosed, and (3) it is proven that no solutions exist. 
With these pruning criteria, a scheme can be constructed that searches the entire 
binary tree and finds all solutions of the equation system. 

In BB, the goal is typically to find a globally optimal solution to some prob- 
lem. BB may be built on top of BP schemes by enbedding an additional pruning 
test. In this test, a node is pruned when its optimal (lower bounding) solution is 




276 Chao- Yang Gan and Mark A. Stadtherr 



guaranteed to be worse (greater) than some known current best value (an upper 
bound on the global minimum). Thus, one avoids visiting subproblems which 
are known not to contain the globally optimal solution. In this context, various 
heuristic schemes may be of considerable importance in maintaining search effi- 
ciency. For example, when solving global minimization problems using interval 
analysis, the best upper bound value may be generated and updated by some 
heuristic combination of an interval extension of the objective function, a point 
objective function evaluation with interval arithmetic, and a local minimization 
with a verification by interval analysis. In order to enhance bounding and prun- 
ing efficiency, some approaches also apply a priority list scheme in BB. Typically, 
all problems in the stack are rearranged in the order of some importance index, 
such as a lower bound value. The idea is that the most important subproblems 
stored in the stack are examined with higher priority, in the hope that the global 
optimum be found early in the search process, thus allowing other later subprob- 
lems that do not possess the global optimum to be quickly pruned before they 
generate new nodes. 

In BB or BP search, the shape and size of the search space typically changes 
as the search proceeds. Portions that contain a solution might be highly ex- 
panded with many nodes and branches, while portions that have no solutions 
might be discarded immediately, thus resulting in an irregularly structured 
search tree. It is only through actual program execution that it becomes ap- 
parent how much work is associated with individual subproblems and thus what 
the actual structure of the search tree is. Since the subproblems to be solved 
are independent, execution of both BP and BB on parallel computing systems 
can clearly provide improvements in computational efficiency; thus the use of 
parallel computing to implement BP and BB has attracted significant attention 
(e.g., [50,51,52,53,54,55,56]). However, because of the irregular structure of the 
binary tree, this implementation on distributed systems is often not straight- 
forward. Details concerning the methodology for implementing BP and BB on 
distributed parallel systems will be discussed in later sections. 



4 Interval Analysis 

Of particular interest here are BP and BB schemes based on interval analysis. A 
real interval Z is defined as the set of real numbers lying between (and including) 
given upper and lower bounds; i.e., Z = = {z e ^ \ z^ < z < z^}. A 

real interval vector Z = (Zi, Z 2 , ■ ■ ■ , Zn)^ has n real interval components and 
can be interpreted geometrically as an n-dimensional rectangle (box). Note that 
in this section lower case quantities are real numbers and upper case quantities 
are intervals. Several good introductions to interval analysis are available (e.g., 
[57,58,59]). In this section, interval analysis is described in the context of solving 
nonlinear parameter estimation problems, since that is the primary example used 
in the tests discussed later. However, it should be emphasized that the interval 
methods discussed here are general-purpose and can be used in connection with 
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other objective functions in a global optimization problem and other equation 
systems in an equation solving problem. 

BP and BB techniques can be constructed using the interval-Newton tech- 
nique. Given a nonlinear equation system with a finite number of real roots in 
some initial interval, this technique provides the capability to find (or, more 
precisely, narrowly enclose) all the roots of the system within the given initial 
interval. For the unconstrained minimization of an objective function (or esti- 
mator) (p{6) in parameter estimation, a common approach is to use the gradient 
of (p{0) and seek a solution of g(0) = V(f>{6) = 0 in order to determine the op- 
timal parameter values 9. The global minimum will be a root of this nonlinear 
equation system, but there may be many other roots as well, representing local 
minima and maxima and saddle points. Thus, for this approach to be reliable, 
the capability to find all the roots of g( 0 ) = 0 is needed, and this is provided 
by the interval-Newton technique. In practice, by using an objective range test, 
as discussed below, the interval-Newton procedure can also be implemented as 
a BB technique, so that roots of g(0) = 0 that cannot be the global minimum 
need not be found. The solution algorithm is applied to a sequence of intervals, 
beginning with some initial interval 0^®^ specified by the user. This initial inter- 
val can be chosen to be sufficiently large to enclose all physically feasible values. 
It is assumed here that the global optimum will occur at an interior stationary 
minimum of (p{6) and not at the boundaries of Since the estimator (p{9) is 
derived based on a product of Gaussian distribution functions corresponding to 
each data point, only a stationary global minimum is reasonable for statistical 
regression problems such as considered here. 

For an interval 0*^^^ in the sequence, the first step in the solution algorithm 
is the function range test. Here an interval extension G(0*-^^) of the function 
g(0) is calculated. An interval extension provides upper and lower bounds on the 
range of values that a function may have in a given interval. It is often computed 
by substituting the given interval into the function and then evaluating the 
function using interval arithmetic. The interval extension so determined is often 
wider than the actual range of function values, but it always includes the actual 
range. If there is any component of the interval extension G(0^^^) that does 
not contain zero, then we may discard (prune) the current interval (node) 0 *^^^, 
since the range of the function does not include zero anywhere in this interval, 
and thus no solution of g(0) = 0 exists in this interval. We may then proceed 
to consider the next interval in the sequence, since the current interval cannot 
contain a stationary point of (j){9). Otherwise, if 0 G G(0^^^), then testing of 
continues. 

The next step is the objective range test. The interval extension ^>(0^^^), 
which contains the range of <t>{6) over 0(^1, is computed. If the lower bound of 
^( 0 (fc)) jg greater than a known upper bound on the global minimum of 0 ( 0 ), 
then 0(^1 cannot contain the global minimum and need not be further tested. 
Otherwise, testing of 0(^1 continues. The upper bound on the objective function 
used for comparison in this step can be determined and updated in a number 
of different ways. Here we use point evaluations of 0(0) done at the midpoint 
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of previously tested 0 intervals that may contain stationary points. Using the 
objective range test yields a BB procedure for the global minimization of (p{9), 
while if this step is skipped, we will have a BP technique for finding all solutions 
of g(0) = 0, i.e., all stationary points of <j){6). 

The next step is the interval-Newton test. Here the linear interval equation 
system 

G'(0(fe))(N('=) - 

is set up and solved for a new interval where G'(0^^^) is an interval 

extension of the Jacobian of g(0), i.e., the Hessian of (p{9), over the current 
interval 0^^\ and 0*-^^ is a point in the interior of 0^^\ usually taken to be the 
midpoint. It has been shown (e.g., [57,58,59]) that any root 9* e 0^^^ of g(0) = 0 
is also contained in the image implying that if there is no intersection 

between 0*^^^ and then no root exists in 0^^\ and suggesting the iteration 
scheme 0(^+i) = 0('=) n In addition to this iteration step, which can be 

used to tightly enclose a solution, it has been proven (e.g., [57,58,59]) that if 
is contained completely within 0 (^\ then there is one and only one root 
contained within the current interval 0 *-^^. This property is quite powerful, as 
it provides a mathematical guarantee of the existence and uniqueness of a root 
within an interval when it is satisfied. 

There are thus three possible outcomes to the interval-Newton test, as shown 
schematically for a two variable problem in Figs. I- 3. The first possible outcome 
(Fig. I) is that C 0 ^^^ . This represents mathematical proof that there exists 
a unique solution to g(0) = 0 within the current interval 0^^^, and that that 
solution also lies within the image This solution can be rigorously enclosed, 
with quadratic convergence, by applying the interval-Newton step to the image 
and repeating a small number of times. Alternatively, convergence to a point 
approximation of the solution can be guaranteed using a routine point-Newton 
method starting from anywhere inside of the current interval. Since a unique 
solution has been identified for this subproblem, it can be pruned, and the next 
interval in the sequence can now be tested, beginning with the function range 
test. 

The second possible outcome (Fig. 2) is that n 0*^^^ = 0. This provides 
mathematical proof that no solutions of g(0) = 0 exist within the current in- 
terval. Thus, the current interval can be pruned and testing of next interval can 
begin. 

The final possible outcome (Fig. 3) is that the image lies partially within 
the current interval 0 *^^^. In this case, no conclusions can be made about the 
number of solutions in the current interval. However, it is known that any solu- 
tions that do exist must lie in the intersection 0*^^^ n If the intersection 

is sufficiently smaller than the current interval, one can proceed by reapplying 
the interval Newton test to the intersection. Otherwise, the intersection is bi- 
sected, and the resulting two intervals added to the sequence of intervals to be 
tested. This approach is referred to as an interval-Newton/generalized-bisection 
(IN/GB) method, and depending on whether or not the objective range test is 
employed, can be interpreted as either a BB or BP procedure. 
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Fig. 1. The computed image is a subset of the current interval . This is 
mathematical proof that there is a unique solution of the equation system in the 
current interval, and furthermore that this unique solution is also in the image. 




Fig. 2. The computed image has a null intersection with the current interval 
0(fe), This is mathematical proof that there is no solution of the equation system 
in the current interval. 
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Fig. 3. The computed image has a nonnull intersection with the current 
interval 0^^^. Any solutions of the equation system must lie in the intersection 
of the image and the current interval. 



It should be emphasized that, when machine computations with interval 
arithmetic operations are done, the endpoints of an interval are computed with 
a directed outward rounding. That is, the lower endpoint is rounded down to 
the next machine-representable number and the upper endpoint is rounded up to 
the next machine-representable number. In this way, through the use of interval, 
as opposed to floating point arithmetic, any potential rounding error problems 
are eliminated, yielding an approach that can provide a computational, not just 
mathematical, guarantee of reliability. Overall, when properly implemented, the 
IN / GB method described above provides a procedure that is mathematically and 
computationally guaranteed to find the global minimum of or, if desired, 
to enclose all of its stationary points (within, of course, the specified initial 
parameter interval 

In implementing an IN/GB algorithm, there are opportunities for the use 
of parallel computing at multiple levels. On a fine-grained level, the basic in- 
terval arithmetic operations can be parallelized (e.g., [60]). On a larger-grained 
level, the solution of the linear interval equation system for the image can be 
parallelized (e.g., [61,62,63,64]). Of course, on a coarse-grained level, each inde- 
pendent subproblem generated in the bisection process can be tested in parallel 
(e.g., [1,6, 8, 9]). It is only this coarsest level of parallelism that will be considered 
here. 

5 Dynamic Load Balancing and Work Scheduling 

As noted above, since the subproblems to be solved are independent, the execu- 
tion of interval-Newton techniques, whether BP or BB, on distributed parallel 
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systems can clearly provide improvements in computational efficiency. And since, 
for practical problems, the binary tree that needs to be searched may be quite 
large, there may in fact be a strong motivation for trying to exploit the oppor- 
tunity for parallel computing. However, because of the irregular structure of the 
binary tree, doing this may not be straightforward. 

While executing a program to assign the unprocessed workload (stack boxes) 
to available processors, the irregularity of the tree could cause a highly uneven 
distribution of work among processors and result in poor utilization of comput- 
ing resources. Newly generated boxes at some tree nodes, due to bisection, could 
cause some processors to become highly loaded while others, if processing tree 
nodes that can be pruned, could become idle or lightly loaded. In this context, we 
need an effective dynamic load balancing and work scheduling scheme to perform 
the parallel tree search efficiently. To manage the load balancing problem, one 
seeks to apply an optimal work scheduling strategy to transfer workload (boxes 
to be tested) automatically from heavily loaded processors to lightly loaded pro- 
cessors or processors approaching an idle state. The primary goal of dynamic 
load balancing algorithms is to schedule workload among processors during pro- 
gram execution, to prevent the appearance of idle processors, while minimizing 
interprocessor communication cost and thus maximizing the utilization of the 
computing resources. 

A common load balancing strategy is the “manager- worker” scheme (e.g., 
[3,4,7,12,19]), in which a single “manager” processor centrally conducts a group 
of “worker” processors to perform a task concurrently. This scheme has been 
popular in part because it is relatively easy to implement. It amounts to using a 
centralized pool to buffer workloads among processors. However, as the number 
of processors becomes large, such a centralized scheme could result in a signif- 
icant communication overhead expense, as well as contention on the manager 
processor. As a result, in many cases, especially in the context of a tightly cou- 
pled cluster, this scheme does not exhibit particularly good scalability. Thus, 
to avoid bottlenecks and high communication overhead, we concentrate here 
on decentralized schemes (without a global stack manager), and consider three 
types of load balancing algorithms specifically designed for network-based par- 
allel computing using message passing. It should be noted, however, that for 
loosely coupled systems, the manager-worker scheme can be quite effective, as 
demonstrated, for example, in the metaNEOS project [65,66]. 

These parallel algorithms adopt a distributed strategy that allows each pro- 
cessor to locally make workload placement decisions. This strategy helps a pro- 
cessor maintain for itself a moderate local workload stack, hopefully prevent- 
ing itself from becoming idle, and alleviates bottleneck effects when applied on 
large-scale multicomputers. All distributed parallel algorithms of this type are 
basically composed of five phases: workload measurement, state information ex- 
change, transfer initiation, workload placement, and global termination. Each of 
these phases is now discussed in more detail. 
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5.1 Workload Measurement 

As the first stage in a dynamic load balancing operation, workload measurement 
involves evaluation of the current local workload using some “work index” . This 
is a criterion that needs to be calculated frequently, and so it must be inexpensive 
to determine. It also needs to be sufficiently precise for purposes of making good 
workload placement decisions later. In the context of interval BP and BB, a good 
approach is to simply use the stack length (number of boxes) as the work index. 
This index is effective in parallel BP and BB scheme because of the following 
characteristics: 

— A long stack represents a heavy workload and vise versa. 

— Exhibiting an empty stack indicates the local processor is approaching an 
idle state. 

— A precise representation of workload by work index may not be needed, since 
it may not be necessary to maintain an equal workload on all processors, but 
merely to prevent the appearance of idle states. 

Thus, the stack length can serve as a simple, yet effective, workload index. 



5.2 State Information Exchange 

After all processors identify their own workload state, the parallel algorithm 
makes this local information available to all other cooperating processors, through 
interprocessor message passing, to construct a global work index vector. The co- 
operating processors are a group of processors participating in load balancing 
operations with a local processor, and define the domain of interprocessor com- 
munication, thereby determining a virtual network for cooperation. The range of 
this domain is critical in determining the cost of communication and the perfor- 
mance of load balancing. One possibility is that the cooperating processors could 
include all processors available on the network, and a global all-to-all communi- 
cation scheme could then used to update global state information. This provides 
a very up-to-date global work index vector but might come at the expense of 
high communication overhead. Alternatively, the cooperating processors might 
include only a small subset of the available processors, with this small subset 
defining a local processor’s nearest “neighbors” in the virtual network. Now 
one needs only to employ cheap local point-to-point communication operations. 
However, without a well-tailored and nested virtual network, and a good load 
balancing algorithm, these local schemes could result in workload imbalance and 
idle states. 

5.3 Transfer Initiation 

After obtaining an overview of the workload state, at least for the group of 
cooperating (“neighboring”) processors, load balancing algorithms now need to 
decide if a workload placement is necessary to maintain balance and prevent 
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an idle state. This is done according to an initiation policy which dictates un- 
der what conditions a workload (box) transfer is initiated, and decides which 
processors will trigger the load balancing operation. Generally, the migration of 
boxes from one processor to another processor is initiated on demand. In this 
context, the load balancing operations are event driven according to different 
procedures, such as a sender-initiate scheme (e.g., [67,68,69]), a receiver-initiate 
scheme (e.g., [70,71,72]) and a symmetric scheme (e.g., [2,73,74]). In the sender- 
initiate scheme, when the workload of any processor is too heavy and exceeds an 
upper threshold, the overloaded processor will offload some of its stack boxes to 
another processor through the network. The receiver-initiate approach works in 
the opposite way by having an underloaded processor request boxes from heav- 
ily loaded processors, when the underloaded processor’s workload is less than 
a lower threshold. The symmetric scheme combines the previous two strategies 
and allows both underloaded and overloaded processors to initiate load balancing 
operations. 

5.4 Workload Placement 

The next step of load balancing algorithm is to complete a workload placement. 
Here the donor processor splits the local stack into two parts, sending one part 
to the requesting processor and retaining the other. This operation is done ac- 
cording to a transfer policy consisting of two rules: a work-adjusting rule and 
a work-selection rule. The work-adjusting rule determinates how to distribute 
workload among processors and how many stack boxes are to be transferred. 
If the requesting processor receives too little work, it may quickly become idle; 
if the donor processor offloads too much work, it itself could also become idle. 
In either case, the result would eventually intensify the communication needed 
to perform later load balancing operations. Many approaches are available for 
this rule. One simple approach is to transfer a constant number of work units 
(boxes) upon receiving a request, such as in a work stealing strategy (e.g., [75]). 
A more sophisticated approach is to adopt a diffusive propagation strategy (e.g., 
[76,77,78]), which takes into account the workload states on both sides and ad- 
justs the workload dynamically with a mechanism analogous to heat or mass 
diffusion. 

In addition to the quantity of workload, as measured by the work index, 
the “quality” of transferred boxes is also an important issue. In this context, a 
work-selection rule is applied to select the most suitable boxes to transmit in 
order to supply adequate work to the requesting processor, and thus reduce the 
demands for further load balancing operations later. Although it is difficult to 
precisely estimate the size of the tree (or total work) rooted at an unexamined 
node (box), many heuristic rules have been proposed to select the appropriate 
boxes. One rule-of-thumb is to transmit boxes near the initial root of the overall 
binary tree, because these boxes tend to have more future work associated with 
the subsequent tree rooted at them (e.g., [79]). While this has been demonstrated 
to be a good selection rule in many tree search applications, this and other such 
selection rules will not necessarily have a strong influence on the performance of 
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a parallel BP algorithm applied to solve equation-solving problems using interval 
analysis. However, the selection rule used can have a strong impact on a parallel 
BB algorithm when solving global minimization problems, since by affecting the 
evaluation sequence of boxes it in turn affects the time at which good upper 
bounds on the global minimum are identified. In general, the earlier a good 
upper bound on the global minimum can be found, the less work that needs 
to be done to complete the global minimization, since this means it is more 
likely that boxes can be pruned using an objective range test. This issue will be 
addressed in more detail in a later section. 

5.5 Global Termination 

Parallel computation will be terminated when the globally optimal solution for 
BB problems, or all feasible solutions for BP problems, have been found over the 
entire binary tree, making all processors idle. For a synchronous parallel algo- 
rithm, global termination can be easily detected through global communication 
or periodic state information exchange. However, detecting the global termina- 
tion stage is a more difficult task for an asynchronous distributed algorithm, not 
only because of the lack of global or centralized control, but also because there 
is a need to guarantee that upon termination no unexamined workload remains 
in the communication network due to message passing. One commonly used ap- 
proach that provides a reliable and robust solution to this problem is Dijkstra’s 
token termination detection algorithm [53,80,81]. 

6 Implementation of Dynamic Load Balancing 
Algorithms 

In this section, a sequence of three algorithms is described for load balancing in 
a binary tree, with each algorithm in the sequence representing an improvement 
in principle over the previous one. The last method represents a combination of 
the most attractive and effective strategies adapted from previous research stud- 
ies, and also incorporates some novel strategies in this context. Interprocessor 
communication is performed using the MPI protocol [82,83], a very powerful and 
popular technique for massage passing operations that provides various commu- 
nication functions as discussed below. In the subsequent section, the performance 
of the three algorithms described will be compared. 

6.1 Synchronous Work Stealing (SWS) 

This first workload balancing algorithm applies a global strategy, and is illus- 
trated in Fig. 4. All processors are synchronized in the interleaving computation 
and communication phases. Synchronous blocking all-to-all communication is 
used to periodically (after some number of tests on boxes) update the global 
workload state information. Then, every idle processor, if there are any, “steals” 
one unit of workload (one box) from the processor with the heaviest workload 
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(the largest number of stack boxes), applying a receiver- initiate scheme. As the 
responsibility for the workload placement decision is given to each individual 
processor, rather than in a centrally controlling manager processor, but global 
communication is maintained, SWS can be regarded as a type of distributed 
manager / worker scheme. 
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Fig. 4. The SWS algorithm uses global all-to-all communication to synchronize 
computation and communication phases. 



The global, all-to-all communication used in this approach provides for an 
easy determination of workload dynamics, and may lead to a good global load 
balancing. However, like the centralized manager/worker scheme, this conve- 
nience also comes at the expense of increased commnnication cost when using 
many processors. Such costs may result in intolerable commnnication overhead 
and degradation of overall performance (speedup) . It should also be noted that 
the synchronous and blocking properties of the communication scheme may cause 
idle states in addition to those that might arise due to an out-of-work condition. 
When using the synchronous scheme, a processor (sender) that has reached the 
synchronization point and is ready for communication needs to stay idle and 
wait for another processor (receiver) to reach the same status, and then initiate 
the communication together. Additional waiting states may occur due to the use 
of blocking communication, since a message-passing operation may not complete 
and return control to the sending processor until the data has been moved to 
the receiving processor and a receive posted. Thus, the main difficulties with the 
SWS approach are the communication overhead and the likely occurrence of idle 
states, with together may result in poor scalability. However, one advantage to 
this approach is that the global communication makes it easy to detect global 
termination. 
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6.2 Synchronous Diffusive Load Balancing (SDLB) 

This second approach for workload balancing follows a localized strategy, by 
using local, point-to-point communication and a local cooperation strategy in 
which load balancing operations are limited to a local domain of cooperating pro- 
cessors, i.e., a group of “nearest neighbors” on some predefined virtual network. 
A diffusive work-adjusting rule is also applied here to dynamically coordinate 
workload transmission between processors, thereby achieving a workload balance 
with a mechanism analogous to heat or mass diffusion, as illustrated in Fig. 5. 
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Fig. 5. SDLB uses a diffusive work-adjusting scheme to share workload among 
neighbors in the virtual network. It is synchronous like SWS. 



Instead of using global communication, point-to-point synchronous blocking 
communication is used to exchange workload state information among cooper- 
ating (neighbor) processors. The gathered information allows a given processor 
to construct its own work index vector indicating the workload distribution in 
its neighborhood. Then, the algorithm uses a symmetric initiation scheme to 
cause the workload (boxes) to “diffuse” from processors with relatively heavy 
workloads to processors with relatively light workloads, in order to maintain a 
roughly equivalent workload over all processors. The virtual network used ini- 
tially here is simply a ring, which gives each processor two nearest neighbors. 
Each local processor, i, adjusts its local workload with a neighbor, j, according 
to the rule 

u{j) = C[workflg{i) - workflg{j)], 

where u{j) is the workload-adjusting index, C is a “diffusion coefficient” and 
workflg is the work index vector. If u{j) is positive and/or greater than a 




Parallel Branch-and-Bound for Chemical Engineering Applications 287 



threshold, the local processor sends out workload (boxes); if u{j) is negative 
and/or less than a threshold, the local processor receives workload (boxes). The 
diffusion coefficient, C, is a heuristic parameter determining what fraction of 
local work to offload, and is set at 0.5 in our applications. This diffusive scheme 
has two advantages. First, when applied at an appropriate frequency, it pro- 
vides some certainty in preventing the appearance of out-of-work idle states. 
Also, compacting multiple units of workload (boxes) together for transmission 
enlarges the virtual grain of the transmitted messages. The use of coarse-grained 
messages to reduce communication frequency tends to minimize the effect of high 
latency in network transmission, especially on Ethernet. For example, less total 
time is wasted in startup time of transmission, thus lowering the average trans- 
mission cost of a work unit (box), as well as the ratio of communication time to 
computation time. It should be noted that in considering message grain there 
may also be maximum message size considerations. 

Though the use of a local communication scheme will reduce communication 
cost to some extent, the use again of synchronous and blocking communica- 
tion operations are still difficulties in achieving good scalability. On the other 
hand, while using local rather than global communication makes the detection of 
global termination less efficient, the synchronous and blocking properties make 
this relatively straightforward. Since the problem of detecting global termina- 
tion becomes more difficult as the number of processors grows, this is another 
important issue in scalability. 

6.3 Asynchronous Diffusive Load Balancing (ADLB) 

In this third load balancing approach, a local communication strategy and dif- 
fusive work-adjusting scheme are used, as in SDLB. However, a major difference 
here is the use of an asynchronous nonblocking communication scheme, one of 
the key capabilities of MPI. The combination of asynchronous communication 
functionality and nonblocking, persistent communication functionality not only 
provides for cheaper communication operations by eliminating communication 
idle states, but also, by breaking process synchronization, makes the sequence of 
events in the load balancing scheme flexible by allowing overlap of communica- 
tion and computation. As illustrated in Fig. 6., when each processor can perform 
communication arbitrarily at any time, and independently of a cooperating pro- 
cessor, all communication operations can be scattered among computation, with 
less time consumed in massage passing. 

In addition to the cheaper and more flexible communication scheme, we in- 
corporate into the ADLB approach two new strategies to try to reduce the 
demand for communication and thereby try to achieve a higher overall perfor- 
mance. First, as noted above, in BP and BB methods, it is not really necessary 
to maintain a completely balanced workload across processors. The actual goal 
is to prevent the occurrence of idle states by simply maintaining a workload to 
each processor sufficiently large to keep it busy with computation. To achieve 
balanced workloads may require a very large number of workload transmissions, 
resulting in a heavy communication burden. However, in this case, many of the 
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Fig. 6. ADLB uses an asynchronous, nonblocking communication scheme, pro- 
viding more flexibility to each processor and overlapping communication and 
computation phases. 



workload transmissions may be unnecessary, since in BP and BB each processor 
deals with its stack one work unit (box) at a time sequentially, leaving all other 
workload simply standing by. For a processor to avoid an idle state, and thus 
have a high efficiency in computation, it is not necessary that its workload be 
balanced with other processors, but only that it be able to obtain additional 
workload from another processor through communication as it is approaching 
an out-of-work state. Thus, we use here a receiver-initiate scheme to initiate 
work transfer only when the number of boxes in a processor’s stack is lower than 
some threshold, which should be set high enough that the processor is not likely 
to complete the work and become idle during the processing of the workload 
request to its neighboring processors. 

As a consequence, we can also implement a second strategy, which eliminates 
the periodic state information exchange and combines the load state information 
of the requesting processor with the workload request message to the donor pro- 
cessor. Upon receiving the request, the donor follows a diffusive work-adjusting 
scheme as described above for the SDLB approach, but with a modification in 
the response to the workload adjusting index. Here, if u{j) is positive and/or 
greater than a threshold, the donor sends out workload (boxes) to the requesting 
processor; otherwise, it responds that there is no extra workload available. Thus, 
when approaching idle, a processor sends out a request for work to all its coop- 
erating neighbors, and waits for any processor’s donation of work. In case of no 
work being transferred, it means that the neighbor processors are also starved 
for work and are making work requests to other neighbors. In this case, the pro- 
cessor will keep requesting work from the same neighbors until they eventually 
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obtain extra work from remote processors and are able to donate parts of it. 
Through such a diffusive mechanism, heavily loaded processors can propagate 
workload to lightly loaded processors with a small communication expense. 

The last step of this load balancing procedure is to detect global termination. 
Because the ADLB scheme is asynchronous, the detection of global termination 
is a more complex issue than in the synchronous case. As noted above, a popular 
and effective technique for dealing with this issue is Dijkstra’s token algorithm 
[53,80,81]. This is the technique used in the ADLB scheme. 

In the next section, we describe tests of the three approaches outlined above 
for load balancing in parallel BP and BB. 

7 Computational Experiments and Results 

7.1 Test Environment 

The performance of an algorithm on a parallel computing system is not only 
dependent on the problem characteristics and the number of processors but also 
on how processors interact with each other, as determined both by a physical 
architecture in hardware and a virtual architecture in software. The physical ar- 
chitecture used in these tests, as illustrated in Fig. 7, is a network-based system, 
comprising 16 Sun Ultra l/140e workstations, physically connected by switched 
Ethernet. As noted above, in comparison to mainframe systems, such a cluster 
of workstations (COW) has advantages in its relatively low expense and easy 
availability of hardware. However, depending on the communication bandwidth 
and on the communication demands of the algorithm being executed, network 
contention can have a serious impact on the performance of such a system, par- 
ticularly if the number of processors is large. 




SWITCHED ETHERNET 



Fig. 7. Physical hardware used is a cluster of workstations connected by switched 
Ethernet. 



Two types of virtual network are used: an all-to-all network (Fig. 8(a)) in 
the case of SWS, and a one-dimensional torus (ring) network (Fig. 8(b)) in the 
cases of SDLB and ADLB. In the SWS algorithm, the all-to-all network is im- 
plemented by the use of global, all-to-all communication. However, in the SDLB 
and ADLB algorithms, in order to reduce communication demands and alleviate 
potential network contention, we only use point-to-point local communication 
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functions and implement the ring network. The load balancing algorithms and 
test problems were implemented in FORTRAN-77 using the MPI protocol [82,83] 
for interprocessor communication. In particular, we used the popular LAM (Lo- 
cal Area Multicomputer) implementation of MPI, developed and distributed by 
the Laboratory for Scientific Computing at the University of Notre Dame. 





(a) 



(b) 



Fig. 8. Virtual network in load balancing: (a) all-to-all network using global 
communication; used for SWS; (b) 1-D torus network using local communication; 
used for SDLB and ADLB. 



7.2 Test Problem 

The test problem used is a global nonlinear parameter estimation problem involv- 
ing a vapor-liquid equilibrium (VLE) model (Wilson’s equation). Such models, 
and the estimation of parameters in them, are important in chemical process 
engineering, since they are the basis for the design, simulation and optimization 
of widely-used separation processes such as distillation [48]. In this particular 
problem, we use as the objective function the maximum likelihood estimator, 
with two unknown standard deviations, to determine two model parameters giv- 
ing the globally optimal fit of the data to the model [84]. In addition to the 
difficult nonlinear objective function, the problem data and characteristics were 
chosen to make this a particularly difficult problem, requiring a few hours of 
computation time on a single processor. Interval analysis, as described above, 
is used to guarantee the correct global solution. The problem can be solved in 
either of two ways. One approach is to treat it as a nonlinear equation solving 
problem, and use the parallel interval BP algorithm to solve for all stationary 
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points of the objective function (there are five stationary points in this problem). 
The alternative approach is to treat it directly as a global optimization problem 
and use the parallel interval BB algorithm. The major difference between the 
two approaches is the use of the objective range test in the BB algorithm. 

7.3 Computational Results 

This parameter estimation problem was solved using the COW system described 
above. During the computational experiments, the COW was dedicated exclu- 
sively to solving this problem; that is, there were no other users either on the 
workstations or on the network. Both the BP scheme solving for all stationary 
points and the BB scheme merely searching for the global optimum were ex- 
ecuted on up to 16 processors using each of the three load balancing schemes 
described above. Both sequential and parallel execution times were measured 
in terms of the MPI wall time function, and the performance of each approach 
evaluated in terms of parallel speedup (ratio of the sequential execution time to 
the parallel execution time) and parallel efficiency (ratio of the parallel speedup 
to the number of processors used). 

For the interval BP problem of finding all stationary points, the speedups 
obtained using the three load balancing algorithms, i.e. SWS, SDLB and ADLB, 
on various number of processors are shown in Fig. 9. All five stationary points 
were found in every experiment. All points in Fig. 9 are based on an average 
over several runs. Since both the sequential runs and all parallel BP runs ex- 
plored the same binary tree and treated an equivalent amount of total work, 
the computational results are repeatable and consistent with negligible devia- 
tions. As expected, the ADLB approach clearly outperforms SWS and SDLB, 
exhibiting only slightly sublinear speedup. This can also be seen in the parallel 
efficiency curves, as shown in Fig. 10. While efficiency curves tend to decrease as 
the number of processors increases, as a consequence of the Amdahl’s law, the 
ADLB procedure maintains a high efficiency of around 95%. Thus, with the only 
slightly sublinear speedup and the very high efficiency on up to 16 processors, it 
seems likely that the ADLB algorithm will be highly scalable to larger numbers 
of processors. 

SWS exhibits the poorest performance of the three load balancing methods. 
This is partly due to a poor global workload distribution, resulting in a rela- 
tively large number of out-of-work idle states, and also partly due to the com- 
munication overhead from using the global synchronous blocking communication 
scheme. In SDLB, the symmetric diffusive work-adjusting scheme using the local 
communication scheme substantially reduces out-of-work idle states by achiev- 
ing an even load balance and thus improving the speedup and efficiency. How- 
ever, while a local communication scheme is employed, the synchronous blocking 
communication functions used retain a high communication cost and represent a 
scaling bottleneck. This issue is addressed in ADLB by using asynchronous non- 
blocking communication functions, allowing the overlap of communication and 
computation. In addition, by working towards a goal of maintaining non-empty 
local work stacks instead of an evenly balanced global workload distribution. 
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Fig. 9. Comparison of load balancing algorithms on equation solving problem: 
speedup vs. number of processors. 




Fig. 10. Comparison of load balancing algorithms on equation solving Problem: 
efficiency vs. number of processors. 
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ADLB provides a large reduction in network communication requirements, thus 
greatly reducing communication bottlenecks. The reduction of such bottlenecks 
in ADLB allows it to achieve a consistently high, nearly linear speedup. 

For solving the parameter estimation problem as a global optimization prob- 
lem with parallel interval BB, only the best load balancing scheme, ADLB, was 
employed. Three different runs using the same problem were made at two, four, 
eight and 16 processors. The resulting speedups are shown in Fig. 11. We first 
observe that all speedups are above the linear speedup line, with a speedup 
over 50 on 16 processors in one case. Superlinear speedup is possible because 
of the broadcast of least upper bounds, which may cause tree nodes (boxes) to 
be discarded earlier than in the sequential case, i.e. there is less work to do in 
the parallel case than in the sequential case. Also, the speedups are not exactly 
repeatable and may vary significantly from run to run. This occurs because of 
slightly different timing in finding and broadcasting improved upper bounds in 
each run. Speedup anomalies, such as the superlinear speedups seen here, are not 
uncommon in parallel BB search, provided the reduction in the work required 
in the parallel case (which often happens but not always) is not outweighed by 
communication expenses or other overhead in the parallel computation. 




Fig. 11. Speedup anomaly and superlinear speedups are observed in solving the 
global optimization problem using the parallel BB algorithm based on ADLB. 
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8 Discussion 

The excellent performance of ADLB on the tests described above provides mo- 
tivation for further improving the ADLB approach for execution on even larger 
numbers of processors and applied to different sizes of problems. One factor we 
have investigated is the effect of the underlying virtual network, which is defined 
to locally coordinate neighbor processors in workload distribution and message 
propagation. Instead of using a 1-D torus (ring) virtual network, a two dimen- 
sional (2-D) torus virtual network, as shown in Fig. 12, has been considered to 
enhance the load balancing performance. When compared to the 1-D torus, a 
2-D torus has a higher communication overhead due to each processor having 
more neighbors, but it also has a smaller network diameter, 2[-\/P/2J vs. [T*/2J, 
thus decreasing the message diffusion distance. It is expected that the trade-off 
between communication overhead and message diffusion distance may favor the 
2-D torus for a larger number of processors. 




Fig. 12. 2-D torus virtual network is implemented in ADLB to achieve high 
scalability when running over larger numbers of processors. 

To evaluate broadly the performance of different parallel algorithms, it is 
useful to carry out a scalability analysis, which examines how well an algorithm 
maintains a constant efficiency as the problem size and the number of processors 
increase. Thus, we carried out an experiment based on the isoefficiency function 
[53], which determines how much problem size needs to increase in proportion 
to the number of processors in order to keep the efficiency at a constant level. 
Small values of the isoefficiency function will correspond to better scalability. 
We have done preliminary experiments, performing isoefficiency analysis with 
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up to 64 processors, which demonstrate the better scalability of the 2-D torus 
virtual network on parallel BB and BP problems. 

Another issue of interest in this context is how to improve the search efficiency 
of interval BB for the global optimum. As noted above, there are priority list 
schemes, such as prioritizing the stack based on a lower bound value, that have 
been demonstrated to be useful in a variety of branch and bound problems. A 
difficulty with using lower bound values is that these may not be sufficiently tight 
to provide any useful heuristic ordering for the evaluation of stack boxes. This 
is particularly true if the lower bound is obtained by simple interval arithmetic, 
which often provides only loose bounds when applied to a complicated function. 

Thus, we have developed another approach aimed at scheduling the stack 
boxes for processing. This is a novel dual stack management scheme in which 
each processor maintains two stacks, a global stack and a local stack. The local 
stack is unprioritized; that is, with workload appearing in the same sequence as 
it is generated in the IN/GB algorithm. The local processor draws its work from 
the local stack as long as it is not empty. This contributes a depth-hrst pattern 
to the overall tree search process. The global stack is also unprioritized, and 
is created by randomly removing boxes from the local stack. The global stack 
provides boxes for workload transmission to other processors. This contributes 
breadth to the tree search process. This dual stack management scheme has been 
demonstrated to be capable of producing consistently high snperlinear speedups 
in BB, reducing the variations from run to run observed previously [85]. 



9 Concluding Remarks 

We have described how load management strategies can be used for effectively 
solving interval BB and BP problems in parallel on a network-based system. Of 
the dynamic load balancing algorithms considered, the best performance was 
achieved by the asynchronous diffusive load balancing (ADLB) approach. This 
overlaps computation and computation by the use of the asynchronous non- 
blocking communication functions provided by MPI, and uses a type of diffusive 
load-adjusting scheme to prevents out-of-work idle states while keeping commu- 
nication needs small. 

The ADLB algorithm was applied in connection with interval analysis, in 
particular with an interval-Newton/generalized bisection (IN/GB) procedure for 
reliable nonlinear equation solving and deterministic global optimization. IN/GB 
provides the capability to find (enclose) all solutions in a nonlinear equation solv- 
ing problem with mathematical and computational certainty, or the capability 
to solve global optimization problems with complete certainty. The results of 
applying ADLB in the equation solving context have shown that the parallel BP 
algorithm can achieve a nearly linear speedup with a consistently high efficiency 
around 95% on up to 16 processors in a one-dimensional torus virtual network. 
Preliminary indications are that ADLB provides high scalability up to 64 pro- 
cessors, and different sizes of problems, when using a 2-D torus virtual network. 
In the context of global optimization, the parallel BB algorithm achieves signif- 
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icantly superlinear speedups, though is somewhat inconsistent in the extent to 
which this occurs. By implementing a new dual stack management scheme in 
connection with ADLB it appears that a consistently high superlinear speedup 
on optimization problems can be obtained. 

Though the test problem here was based on a global parameter estimation 
problem, it should be emphasized that the parallel IN/GB method is general- 
purpose and can be used in connection with a wide variety of global optimization 
problems and nonlinear equation solving problems. Also, the load management 
schemes described can be applied to a wide variety of other tree search prob- 
lems in chemical process engineering, such as in process synthesis and process 
scheduling. 
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Abstract. A parallel implementation of the specialized interior-point 
algorithm for multicommodity network flows introduced in [5] is pre- 
sented. In this algorithm, the positive definite systems of each iteration 
are solved through a scheme that combines direct factorization and a 
preconditioned conjugate gradient (PCG) method. Since the solution of 
at least k independent linear systems is required at each iteration of the 
PCG, k being the number of commodities, a coarse-grained parallelliza- 
tion of the algorithm naturally arises. Also, several other minor steps of 
the algorithm are easily parallelized by commodity. An extensive set of 
computational results on a shared memory machine is presented, using 
problems of up to 2.5 million variables and 260,000 constraints. The re- 
sults show that the approach is especially competitive on large, difficult 
multicommodity flow problems. 



1 Introduction 



Multicommodity flows are among the most challenging linear problems, due 
to the large size of these models in real world applications (e.g., routing in 
telecommunications networks). Indeed, these problems have been used to test 
the efficiency of early interior-point solvers for linear programming [1] . The need 
to solve very large instances has led to the development of both specialized 
algorithms and parallel implementations. 

In this paper, we present a parallel implementation of a specialized interior- 
point algorithm for multicommodity flows [5] . In this approach, the block-angular 
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structure of the coefficient matrix is exploited for performing in parallel the solu- 
tion of small linear systems related to the different commodities, unlike general- 
purpose parallel interior-point codes [2,8,18] where the parallelization effort is 
focused on the Cholesky factorization of one large system. This has already been 
proposed [17,10,14]; however, all the previous approaches require to compute and 
factorize the Schur complement. This can become a significant serial bottleneck, 
since this matrix is usually prohibitively dense. Although this bottleneck can 
be partly eluded by using parallel linear algebra routines, our approach takes a 
more radical route by avoiding to form the Schur complement, and using an iter- 
ative method instead. There have been other proposals along these lines [23,15], 
but limited to the sequential case; also, so far no results have been shown for 
these algorithms. The implementation presented in this paper significantly im- 
proves on the preliminary one described in [6]. There, only some of the major 
routines were parallelized, and less attention was paid to communication and 
data distribution. Working on these details allowed us to obtain new and better 
computational results. 

Fi'om the multicommodity point of view, this approach is different from most 
other parallel solvers [7,16,20,26,22,13] in that is not based on a decomposition 
approach. The structure of the multicommodity flow problem has led to a num- 
ber of specialized algorithms, most of which share the idea of decomposing in 
some way the problem into a set of smaller independent problems. These are 
all iterative methods, where at each step the subproblems are solved, and their 
results are used in some way to modify the subproblems to be solved at the next 
iteration. Hence, these approaches are naturally suited for coarse-grained paral- 
lelization. Parallel price-directive decomposition approaches have been proposed 
based on bundle methods [7,20], analytic center methods [13] or linear-quadratic 
penalty functions [22]. Parallel resource-directive approaches are described in 
[16]. Finally, experiences with a parallel interior-point decomposition method 
are presented in [26]. A discussion of these and other parallel decomposition 
approaches can be found in [7]. A general description of the parallelization of 
mathematical programming algorithms can be found in [3,24]. 

The paper is organized as follows. Section 2 presents the formulation of the 
problem to be solved. Section 3 outlines the specialized interior-point algorithm 
for multicommodity flows proposed in [5], including a brief description of the 
general path-following method. Section 4 deals with the parallelization issues of 
the algorithm. Finally, Section 5 presents and discuss the computational results. 



2 Problem Formulation 

The multicommodity flow problem requires to find the least-cost routing of a 
set of k commodities through a network of m nodes and n arcs, where the arcs 
have an individual capacity for each commodity, and a mutual capacity for all 
the commodities. The node-arc formulation of the problem is 
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/ I ... I I 




x'^ 




u 


0 < x^ < u , 


0 < 


X* < 


U® 



( 1 ) 



Vectors a;* e IR” are the flow arrays for each commodity, while e H” are the 
slacks of the mutual capacity constraints. E e is the node-arc incidence 

matrix of the underlying directed graph, while / denotes the n x n identity 
matrix. We shall assume that if is a full row-rank matrix: this can always be 
guaranteed by removing any of the redundant node balance constraints, c* € H” 
and M* e H" are respectively the flow cost vector and the individual capacity 
vector for commodity i, while u G H” is the vector of the mutual capacities. 
Finally, 5® e is the vector of supplies/demands for commodity i at the 
nodes of the network. 

The multicommodity flow problem is a linear program with m = km + n 
constraints and h = (fc-l- l)n variables. In some real-world models, k can be very 
large: for instance, in many telecommunication problems a commodity represents 
the flow of data/voice between two given nodes of the network, and therefore k « 
m^. Thus, the resulting linear program can be huge even for graphs of moderate 
size. However, the coefficient matrix of the problem is highly structured: it has 
a block-staircase form, each block being a node-arc incidence matrix. Several 
methods have been proposed which exploit this structure; one is the specialized 
interior-point algorithm described in the next section. 



3 A Specialized Interior-Point Algorithm 

In [5] , a specialized interior-point algorithm for multicommodity flows has been 
presented and tested. This algorithm, and the code that implements it, will be 
referred to as IPM. 

IPM is a specialization of the path-following algorithm for linear program- 
ming [27]. Let us consider the following linear programming problem in primal 
form 

min { cx : Ax = b, x + s = u, x,s>0}, (2) 

where x G H” and s G H" are respectively the primal variables and the slacks 
of the box constraints, u G H", c G H" and b G are respectively the upper 
bounds, the cost vector and the right hand side vector, and A G is a full 

row-rank matrix. The dual of (2) is 

max {yb — wu : yA + z — w = c, z,w > 0 } , (3) 

where y G H”®, 2 G H" and w G H®® are respectively the dual variables of the 
structural constraints Ax = b, the dual slacks and the dual variables of the box 
constraints x < u. 
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Replacing the inequalities in (2) by a logarithmic barrier in the objective 
function, with parameter /i, the KKT optimality conditions of the resulting 
problem are 

Txz = ne — XZe = 0 
raw = ne — SWe = 0 

rb = b - Ax = 0 . . 

Tc = c — {yA + z — w) = 0 ^ 

r„= M — a; — s = 0 
{x, s, z,w) > 0 , 

where e is the vector of I’s of proper dimension, and each uppercase letter 
corresponds to the diagonal matrix having as diagonal elements the entries of 
the corresponding lowercase vector. In the algorithm we impose r„ = 0, i.e. 
s = u — X, thus eliminating n variables. 

The (unique) solutions of (4) for each possible ^ > 0 describe a continuous 
trajectory, known as the central path, which, as p tends to 0, converges to the 
optimal solutions of (2) and (3). A path- following algorithm attempts to reach 
close to these optimal solutions by following the central path. This is done by 
performing a damped version of Newton’s iteration applied to the nonlinear 
system (4), as shown in (5). The steplengths ap and ap are the maximum 
allowable values in the range (0, 1] such that the new iterate will keep on being 
strictly positive (note that when ap = ap = 1 the algorithm performs a pure 
Newton iteration) . A more detailed description of the algorithm can be found in 
many linear programming textbooks, e.g. [27]. 



Procedure PathFollowing{A, b, c, u): 

Initialize a: > 0, s > 0, j/, z > 0, re > 0; 
while {x, s,y, z,w) is not optimum do 
& = (X-^Z + S-'^W)-^; 
r = S~'^rsw +rc - X-'^r^cz] 

{A&A^)Ay = rb + A0r; 

Ax = 0{A^ Ay — r); 

Aw = S~^{rsw + W Ax)-, 

Az = rc + Aw — AJ" Ay, 

Compute ap £ (0, 1], od £ (0, Ij; 

X <— X -I- apAx; 

iy,z,w) ^ {y, z,w) + ap{Ay, Az, Aw)-, 



The main computational burden of the algorithm is the solution of the system 

{A0A^)Ay = rb + A0r = b. (6) 

Note that A0A^ is symmetric and positive definite, as 0 is clearly a posi- 
tive definite diagonal matrix. Usually, interior-point codes solve (6) through a 
Cholesky factorization, proceeded by a permutation of the columns of A aimed at 
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minimizing the fill-in effect. Several effective heuristics have been developed for 
computing such a permutation. Unfortunately, when A is the constraints matrix 
of (1), the Cholesky factors of AOA^ turn out to be rather dense anyway [5]. 

However, the structure of A can be used to solve (6) without computing 
the factorization of AGA^ . Note that 0 is partitioned into the k blocks 0®, 
i = 1 . . .k, one for each commodity, plus the block 0° corresponding to the 
slack variables x'^ of the mutual capacity constraints. Hence, 




i.e., B is the block diagonal matrix having the m x m matrices Bi = EG^E"^ , 
f = 1 . . . fc, as diagonal elements, and 

0^ = [0f . . . 0j] = [0^0^ . . . G'^E'^] . 

Exploiting (7), and partitioning the vectors Ay and b accordingly, the solution 
of (6) is reduced to 



D-J2cT ]Ay°=b^-J2 cl ^ (5^ 



B,AE = {E - aAy°) = p\ i = l...k. (9) 

The matrix 

k 

H = D -C'^B-^C = D -^CjB-^Ci (10) 

i=l 

is known as the Schur complement. 

Thus, (6) can be solved by means of (8), involving the Schur complement H , 
followed by the k subsystems (9) involving the matrices Bi. The latter step can 
be easily parallelized. However, solving (8) with a direct method, as advocated in 
[17,10], requires forming and factorizing H . As shown in [5], this matrix typically 
becomes rather dense, hence such a direct approach may become computation- 
ally too expensive. Furthermore, it represents a formidable serial bottleneck for 
a parallel implementation of the code. As suggested in [17], this bottleneck can 
be reduced by using parallel linear algebra routines [2,8,18]. However, it is also 
possible to avoid forming H at all, solving (8) by means of an iterative algorithm. 

Since H is symmetric and positive definite, a preconditioned conjugate gra- 
dient (PCG) method can be used. In [5], a family of preconditioners is proposed, 
based on the following characterization of the inverse of H : 




where 
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A preconditioner for (8) can be obtained by truncating the above power series at 
the /i-th term. Clearly, the higher h, the better the preconditioning will be, and 
the fewer PCG iterations will be required. However, preconditioning one vector 
requires solving kxh linear systems involving the matrices Hj, thereby increasing 
the cost of each PCG iteration. The best trade-off between the reduction of the 
iterations count and the cost of each iteration is h = 0, corresponding to the 
diagonal preconditioner D~^ [5]. 

The IPM code, implementing this algorithm, has shown to be competitive 
with a number of other sequential approaches [5]. It is written mainly in C, 
with only the Cholesky factorization routines (devised by E. Ng and B. Peyton 
[21]) coded in Fortran. Both the sequential and parallel versions can be freely 
obtained for academic purposes from 

http : //www-eio . upc . es/~ j castro/ software . html . 



4 Parallelization of the Algorithm 

The solution of (6) is by far the most expensive procedure in the interior-point 
algorithm, consuming up to 97% of the total execution time for large problems. 
With the above approach, this can be accomplished by means of the following 
steps: 

— Factorization of the k matrices Bi; note that the current implementation 
uses sequential Cholesky solvers, but parallel Cholesky solvers could be used 
for increasing the degree of parallelism of the approach. 

— Computation of /?° = 6° — ^ which requires k backsolves on 

the factorizations of Bi and matrix- vector products of the form Cfv^. 

— For each iteration of the PCG, computation of {D — Yli=i which 

requires backsolves on the factorizations of Bi and matrix-vector products 
of the form CjU® and Cfv^. 

— Computation of (3'' = If — CiAy^, which requires matrix- vector products of 
the form CjU*. 

— Solution of the systems BiAy^ = /3b 

Hence, most of the parallelization effort boils down to performing in parallel 
the factorization of the BiS, backward and forward substitution with these fac- 
torizations and matrix- vector products involving Ci or Cf . Thus, there is no 
need for sophisticated implementations of parallel linear algebra routines. Note 
that higher-order preconditioners {h > 0) would complicate somehow the above 
scheme, but the basic blocks would remain the same. 

Although the above procedures are by far the most important, a number 
of other minor steps can be easily parallelized, such as the computation of the 
other primal and dual directions (Z\x*, Az'^ , Aw"^), the computation of the primal 
and dual steplenghts ap and ap, the updating of the current primal and dual 
solution, the computation of the primal and dual objective function values and so 
on. It is easy to see that all the data concerning one given commodity i (x*, c% m*, 
y*, w* . . .) can be stored in the local memory of the one processor that is in charge 
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of that commodity, and it is never required by other processors. This ensures a 
good “locality” of data, and a low need for inter-processor communication. It 
should also be noted that the number of operations required for each commodity 
is the same, which guarantees the load balancing between processors, at least as 
long as the number of commodities assigned to each processor is the same. 

4.1 Parallel Programming Environment 

The parallel version of the IPM code, pIPM, has been developed on the Sili- 
con Graphics 0rigin2000 (SGI 02000) server located at the European Genter 
for Parallelism of Barcelona (CEPBA), running an IRIX64 6.5 Unix operat- 
ing system. The SGI 02000 offers both message-passing and shared-memory 
programming paradigms, although the main memory is physically distributed 
among the processors. The server has 64 MIPS RIOOOO processors running at 
250Mhz, each of them with 32-|-32Kb LI cache and 4Mb L2 cache and credited 
of 14.7 SPECint95 and 24.5 SPEGfp95. A total of 8Gb of memory is distributed 
among these processing elements. This computer appeared at position 275 of the 
TOP500 November 1998 supercomputer sites list [11]. 

The default programming style supported by the SGI 02000 is a custom 
shared-memory version of C [25], with parallel constructs specified by means 
of compiler directives (#pragmas). Both OpenMP and SGTspecific pragmas are 
supported by the SGI 02000, but we mainly used the SGI-specific ones for the 
current version of pIPM. Placement of the memory on the processors and com- 
munication is hidden to the programmer and automatically performed by the 
system. The main advantage of this choice is ease of portability: existing codes 
can be parallelized with a limited effort. It is even possible to avoid maintain- 
ing two different versions (sequential and parallel) of the same code, which is 
important to optimize the development efforts. 

However, this programming style also has a number of drawbacks, mainly a 
limited control over memory ownership and limited support for vector-broadcast 
and vector-reduce operations. Placement of the data structures in the local mem- 
ory of the processors can be only partly (and indirectly) influenced by the pro- 
grammer. Also, the granularity of memory placement is that of the virtual mem- 
ory pages (I6K) rather than that of the individual data structures. All this can 
result in cache misses and page faults from the local memory of each processor, 
decreasing the performance of the parallel codes. Although advanced directives 
allow a more detailed control over these features, the use of those directives re- 
quires a more extensive rewriting of the code, thus loosing part of the benefits in 
terms of portability and ease of maintenance. Because of that, the computational 
results presented in Section 5 were obtained with the default data distribution 
provided by the system (the same used in [2]). However, the assignment of com- 
modities to processors was optimized for this distribution, hopefully limiting the 
possible negative effects. The limited support for broadcast/reduce operations 
is understandable in a shared-memory oriented language; however, it may result 
in poorer performances for codes, like pIPM, where these operations amount at 
almost the totality of the communication time. 
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5 Computational Results 

5.1 The Instances 

Three sets of multicommodity instances were used for the computational exper- 
iments. The first is made up of 18 problems obtained with an improved version 
of Ali and Kennington’s Mnetgen generator [12], These instances are very large 
(up to about 2.5 millions of variables and 260,000 constraints), with a number of 
commodities which varies from very few (8) to quite many (512). This is useful 
for characterizing the trends in the performances of the code as the number of 
commodities varies [7,12]. 

The second set consists of ten of the PDS (Patient Distribution System) 
problems. These problems arise from a logistic model for evacuating patients 
from a place of military conflict. The different instances arise from the same 
basic scenario by varying the time horizon, i.e., the number of days covered by 
the model. The PDS problems have been considered, until recently, essentially 
impossible to solve with a high degree of accuracy. Although this has changed, 
they are still quite challenging multicommodity instances. 

The third set of problems is made of the four Tripart problems and of the 
Gridgenl problem. These instances were obtained respectively with the Tripar- 
tite generator and with a multicommodity version of the well-known Gridgen 
single-commodity flow generator [4]. These are very difficult multicommodity 
flow instances, as shown in Section 5.3. 

The dimensions of each problem are reported in Tables 1, 2 and 3. Columns 
“to”, “n”, and “A:” show the number of nodes, arcs, and commodities. Columns 
“h” and “to” give the number of variables and constraints of the linear problem. 
All the instances can be downloaded from 

http : //www . di .unipi . it/di/groups/optimize/Data. 

5.2 Performance Measures 

The following well-known performance measures [3] will be considered for assess- 
ing the performances of pIPM. Denoting by Tp the execution time obtained with 
p processors, the speedup Sp with p processors can be defined as Sp = Ti/Tp. 
The fraction of the sequential execution time consumed in the parallel region of 
the code will be denoted by /; values of / close to 1 are necessary in order to 
obtain good speedups, as demonstrated by Amdahl’s law 

S -T- ^ ^ 

" f/p + {i-f) - (1-/) ■ 

The efficieney with p processors is 

p < -p- _ 

p p 

Ep represents the fraction of the time that a particular processor (of the p 
available) is usefully employed during the execution of the algorithm. Sp and Ep 
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are respectively the ideal speedup and efficiency, the maximum ones that can be 
obtained due to the inherent serial bottlenecks in the algorithm. 

Another interesting performance measure is the absolute speedup, obtained 
by replacing Ti with the execution time of the best serial algorithm known. This 
is usually difficult to obtain, and it will be discussed separately. 



5.3 The Results 

Tables 1, 2 and 3 show the computational results obtained. Columns “IP” and 
“PCG” report the total number of interior-point and PCG iterations, respec- 
tively. Column “/” gives the fraction of the total sequential time consumed in 
the parallel region of the code. Column “p” gives the number of processors used 
in the execution. “Tp” denotes the execution (wall-clock) time, excluding initial- 
izations. Columns “S'p” and “Ep” give respectively the observed speedups and 
efficiencies, while columns “Sp” and “Ep” report their ideal values. 

Analyzing the results, the following trends emerge: 

— / is always fairly large, and increases with the problem size; the largest 
problems attain very high ideal efficiencies. This indicates that the approach 
has a good potential for scalability, at least in theory, for very large scale 
problems. 

— For fixed p and k, Ep almost always increases with the size of the underlying 
network, in all three groups of instances. This is reasonable: the computa- 
tional burden of the PCG iteration grows quadratically with the number 
of nodes, while the communication cost grows only linearly. This seems to 
indicate that the approach is especially suited for problems where the size 
of the network is large w.r.t. the number of commodities. Remarkably, IPM 
has been shown to be particularly efficient, at least w.r.t. decomposition 
approaches, exactly for this kind of instances [12]. 

— Keeping p and the size of the network fixed, Ep initially increases with k; 
however for “large” values of k Ep stalls, and may even decrease. This phe- 
nomenon, clearly visible in the Mnetgen results, is difficult to explain. For 
fixed p, increasing k can, in theory, only increase the fraction of time that is 
spent in the parallel part of the algorithm, while the sequential bottleneck 
and the communication requirements should remain the same. Indeed, Ep 
is monotonically nondecreasing with k. This decrease in efficiency is most 
likely an effect of the page-based memory placement, which may cause data 
logically pertaining to one processor to be physically located on another. 

— For any fixed instance, Ep obviously decreases as p increase; unfortunately, 
the decrease is much faster than that predicted by Ep, so that the gap 
between Ep and Ep increases with p. However, for fixed p the gap decreases 
when the size of the network increase, and a similar — although less clear — 
trend seems to exist w.r.t. k. Thus, whatever mechanism be responsible for 
this discrepancy between Ep and Ep, its effects seem to lessen as the instances 
grow larger. 
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Table 1. Dimensions and results for the Mnetgen problems. 
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Table 2. Dimensions and results for the PDS problems. 
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Table 3. Dimensions and results for the Tripart and Gridgen problems. 
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Since, except for PDS problems with p = 6, each processor is assigned the 
same number of commodities, there can be no load imbalance between the pro- 
cessors. Thus, the gap between Ep and Ep can only be explained as being due 
to communication time. Indeed, pIPM requires more communication than most 
other parallel codes for multicommodity flows. Most of communication occurs 
during the computation of {jJ — Cf v, where v is the current esti- 

mate of the solution of (8), at each PCG iteration. This requires first the broad- 
cast of V from the “master” processor (the one executing the serial-only part 
of the code) to all the other processors, followed by a vector-reduce operation 
to accumulate all the partial results Cf B~^v back to the “master” processor. 
The amount of communication is essentially the same as in the decomposition 
approaches [7,13,22], and substantially lower than that of the other specialized 
parallel interior-point codes [17,10], which need to share the (dense) matrices 
Cf B~^Ci in order to form the Schur complement 77. However, in pIPM com- 
munication occurs at every PCG iteration, i.e., much more often than in decom- 
position codes. The other specialized parallel interior-point codes have a much 
smaller number of communication “rounds” , one for each interior-point iteration, 
although each round is more expensive. 

Thus, pIPM may be inherently more vulnerable to slowdowns induced by 
communication costs. Indeed, the efficiency of pIPM seems to be, on average, 
somehow worse than that of the approach in [17], even though direct comparison 
is difficult due to the different sets of test problems. The instances used in [17] 
are much smaller, and the cost of forming and factorizing H grows rapidly with 
the size of the problem. 

Furthermore, the current implementation of pIPM, using the parallel con- 
structs available in the SGI 02000 G compiler [25], is not aggressively optimized 
particularly in the two critical operations, i.e., broadcasts and vector-reduces. 
Both are currently obtained by means of read/write operations to shared vec- 
tors, which are presumably less efficient than the typical system-provided imple- 
mentation which exploits information about the topology of the interconnection 
network and the available communication hardware. Also, a part of the commu- 
nication overhead could be due to a non-optimal placement of the data structures 
in the local memory of the processors, especially at the boundaries of the virtual 
memory pages. Thus, we believe that there is still room for (potentially large) re- 
ductions of the gap between the observed and the theoretical speedup/efficiency 
of the code. However, pIPM already attains quite satisfactory efficiencies in some 
instances, most notably the largest PDS problems. 

As far as the absolute speedup is concerned, IPM is known not to be the 
fastest sequential code for some of the test instances. In [12], a bundle-based 
decomposition approach has been shown to outperform IPM on the Mnetgen 
instances, while IPM was competitive on the PDS problems. Furthermore, re- 
cent developments in the field of simplex methods [19] have lead to impressive 
performance improvements for these algorithms on multicommodity flow prob- 
lems. Nowadays, even the largest PDS problems can be solved in less than an 
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Table 4. Comparing Cplex 6.5, IPM, and pIPM on the Tripart and Gridgen 
problems. 



Problem 


1 IPM Gplex 6.5 


Gplex6.5 / pIPM* 


Tripart 1 


40 


74 


3.3 


Tripart2 


249 


627 


6.5 


Tripart3 


1584 


2851 


6.7 


Tripartd 


4983 


33235 


36.3 


Gridgen 1 


126008 > 2.8e-b6 


253.1 



* Considering the maximum number of 
processors for pIPM 



hour of CPU with the state-of-the-art simplex code Cplex 6.5 [9]. However, the 
simplex method is not easily parallelized. Furthermore, other multicommodity 
problems, like the Tripart and the Gridgen, are much more difficult to solve; 
e-approximation algorithms can approximatively solve them in a relatively short 
time [4], but only if the required accuracy is not high. On these instances, the 
interior-point algorithm in Cplex 6.5 is far more efficient than the dual simplex, 
but it is in turn largely outperformed by IPM, as shown in Table 4. Columns 
“IPM” and “Cplex 6.5” show the running time required for the solution of the 
problem by IPM and Cplex 6.5, respectively, on a Sun Ultra2 2200/200 worksta- 
tion (credited of 7.8 SPECint95 and 14.7 SPECfp95) with 1Gb of main memory. 
The last column shows the estimated ratio between the running time of Cplex 
6.5 and that of pIPM run on the largest possible nnmber of processors, prov- 
ing that, at least for the largest and more difficult instances of the set, pIPM 
provides a competitive approach. 



6 Conclusions and Future Research 

The parallel code pIPM presented in this work can be an efficient tool for the 
solution of certain types of large and difficult multicommodity problems. Quite 
good speedups are achieved in some instances, such as the large PDS problems. In 
other cases, a gap between the ideal efficiency and the observed one exists. How- 
ever, we are confident that a more efficient implementation of reduce/broadcast 
operations and a better placement of data structures — which could mean using 
MPI or PVM as parallel environments — can make pIPM even more competitive 
on a widest range of multicommodity instances. 
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Abstract. In this paper we present a parallel algorithm that solves the 
Toeplitz Least Squares Problem. We exploit the displacement structure 
of Toeplitz matrices and parallelize the Generalized Schur method. The 
stability problems of the method are solved by using a correction pro- 
cess based on the Corrected Semi-Normal Equations [10]. Other problems 
arising in the parallelization of the method, such as the data dependencies 
and high communication cost, have been addressed with an optimized 
distribution of the data, rearrangement of the computations, and with 
the design of new basic parallel subroutines. We have used standard tools 
like the ScaLAPACK library based on the MPI environment. Experimen- 
tal results have been obtained in a cluster of personal computers with a 
high performance interconnection network. 

1 Introduction 

Our goal is to obtain an efficient parallel solution of the Least Squares (LS) 
problem 

min \\Tx — h \\2 , (1) 

X 

where T e is a Toeplitz matrix, = ti-j G R for f = 0, . . . , m — 1 and 

j = 0, . . . , n — 1, and b G R™ is an arbitrary vector. 

This problem arises in many applications such as time series analysis, image 
processing, control theory, statistics, in some cases with real-time constraints 
(e.g. in radar and sonar applications). 

It is well known that the LS Problem can be solved in 0{mv?) flops in gen- 
eral, using e.g. Householder transformations [8]. However, in the case of Toeplitz 
matrices, several fast algorithms with a cost of 0{mn) flops have been devel- 
oped [13, 7]. All these fast algorithms are generalizations of a classical algorithm 
by Schur [11]. Nevertheless, these algorithms are less stable than the algorithm 
included in LAPACK [3] based on Householder transformations. A good overview 

* Partially funded by the Spanish Government through the project CIGYT TIG96- 
1062-C03. 

J.M.L.M. Palma et al. (Eds.): VECPAR2000, LNCS 1981, pp. 316-329, 2001. 
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of the stability and accuracy of fast algorithms for structured matrices can be 
found in [5]. 

In this paper, we exploit several enhancements to improve the accuracy of the 
fast algorithm based on the Generalized Schur Algorithm. The parallel algorithm 
that we have implemented is based on the sequential method proposed by H. 
Park and L. Elden [10]. In that paper the accuracy of the R factor computed by 
the Generalized Schur Algorithm is improved by post-processing the R factor 
using Corrected Semi-Normal Equations (CSNE). Our parallel algorithm inherits 
the accuracy properties of that method. 

Assume that the matrix Rq e j^("+i)x("+i) jg upper triangular submatrix 
in the i?-factor of the QR decomposition for the matrix [b T], 

i?o = qr([6T])= , At e R , e , G e R"^” , (2) 

where G is upper triangular and where the operator qr denotes the upper square 
submatrix of the R factor of the QR decomposition of a given matrix. Then, the 
LS problem (1) can be solved via a product of Givens rotations J that reduces 
a Hessenberg matrix to the upper triangular form, 

where R £ R”xn upper triangular factor of the QR decomposition of T. 

The vector solution x in (1) is obtained by solving the triangular linear system 
Rx = Ti [8]. 

Solving the LS problem (1) involves four main steps that we summarize in 
Algorithm 3 in section 5. First we form a generator pair (which we will explain 
further on). Starting from the generator pair, we obtain the triangular factor 
G which appears in (2) by means of the Generalized Schur Algorithm. The 
third step consists of refining that factor G in order to improve its accuracy. 
The last step is an ordinary solution of a triangular system, as made in the 
standard method for solving a general LS problem via a QR decomposition of 
the augmented matrix [T b]. 

We have parallelized the four steps. The second step is described in section 2 
while the third one is described in sections 3 and 4. With the parallel version of 
the Generalized Schur Algorithm proposed, we can reduce the time needed to 
obtain the triangular factor G and we obtain a parallel kernel available to other 
problems based on this method (e.g. linear structured systems, see [1, 2]). With 
the parallel version of the third step we can reduce the overcost introduced by 
the refinement step. 

Several problems arise in the development of the parallel version of the 
method. First, the low computational cost of the algorithm makes it difficult 
to obtain high parallel performances. Second, the sequential algorithm has a lot 
of data dependencies as in many other fast algorithms for structured matrices. 
This fact reduces drastically the granularity of the basic computational steps, in- 
creasing the cost of the communications. However, we have reduced the effect of 
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these problems by using an appropriate data distribution, reducing the number 
of messages by rearranging the operations, and implementing new basic parallel 
subroutines adapted to the chosen data distribution. 

We have implemented all algorithms in FORTRAN, using BLAS and LA- 
PACK libraries for the sequential subroutines, and PBLAS and ScaLAPACK [4] 
libraries for solving basic linear algebra problems in parallel and for data dis- 
tribution. We have used subroutines of the BLACS package over MPI [12] to 
perform the communications. The use of standard libraries assures the portabil- 
ity of the algorithms and produces a parallel program based on well known and 
efficient tested public code. As it is shown below, the best topology to run the 
parallel algorithm is a logical grid of p x 1 processors. 

2 Exploiting the Displacement Structure 

The displacement of the matrix [b T]'^[b T] with respect to the shift matrix 
Z = [z^ where z = l\ii=j + l and z = 0 otherwise, is denoted by 
and defined as 

Vz = [b Tf[b T] - Z[b Tf[b T]Z^ = Qjg^ . (4) 

The matrix [b T]’^[b T] has low displacement with respect to Z if the rank 
of Vz is considerably lower than n [9]. The factor ^ is an n x 6 matrix called 
generator, and J is the signature matrix (J3 © —Is). The pair (Q, J) is called a 
generator pair. Given a general Toeplitz matrix T and an arbitrary vector b, the 
generator pair for equation (4) is known [10]. 

The Generalized Schur Algorithm computes the Rq factor in (2). This is 
equivalent to perform the following Cholesky decomposition, 

[b Tf[b T] = R^Ro . 

The cost of this algorithm is 0{mn) flops instead of the 0{mn^) flops required 
by the standard LAPACK algorithm. The Generalized Schur Algorithm is a 
recursive process of n steps. In the t-th step, a J"-unitary transformation (9^ 
{0iHOj = J) is computed in order to transform the first nonzero row of Q to 
a vector of the form (xOOOOO). The first column of generator Q is then the 
t-th row of i?o- Each jT’-unitary transformation is performed by a composition 
of two Householder transformations and a hyperbolic rotation [6]. 

In the parallel version of the Generalized Schur Algorithm that we present, 
the generator Q is divided in blocks of v rows and cyclically distributed over a 
p X 1 processors grid as it is shown in Fig. 1. The processor having the z-th row 
of g (first nonzero row of the generator denoted by the x entries) computes the 
iJ-unitary transformation and broadcasts it to the rest. The rows from z to n 
of the generator are then updated in parallel. Afterwards, the nonzero entries of 
the distributed first column of g are copied on to the z-th column (entries li) 
without communication cost. When the n steps have been executed, we get the 
transpose upper triangular R factor from the QR decomposition of the matrix 
[b T] distributed over all the processors. 
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0 0 0 0 0 0 
PO 0 0 0 0 0 0 
0 0 0 0 0 0 
X X X X X X 
X X X X X X 
PI X X X X X X 
X X X X X X 
X X X X X X 
X X X X X X 
P2 X X X X X X 
X X X X X X 
X X X X X X 
X X X X X X 
PO X X X X X X 
X X X X X X 
X X X X X X 
X X X X X X 
PI X X X X X X 



liO 0 000000000000000 
I1I2 0 000000000000000 
I1I2I3 000000000000000 
I1I2I3 000000000000000 
I1I2I3 000000000000000 
I1I2I3 000000000000000 
I1I2I3 000000000000000 
I1I2I3 000000000000000 
I1I2I3 000000000000000 
I1I2I3 000000000000000 
I1I2I3 000000000000000 
I1I2I3 000000000000000 
I1I2I3 000000000000000 
I1I2I3 000000000000000 
I1I2I3 000000000000000 
I1I2I3 000000000000000 
I1I2I3 000000000000000 
I1I2I3 000000000000000 



Fig. 1. Block row cyclic distribution of an example generator G of size 18 x 6 
(entries x) and a 18 x 18 lower triangular factor L (entries lij obtained by 
the Generalized Schur Algorithm, with a block size of v=f rows over o 3 x 1 
processor grid. The figure shows the distributed workspaces after three steps of 
the algorithm. 



Finally, in each step, the first column of generator Q has to be shifted one 
position down. This operation involves a critical communication cost with respect 
to the small computational cost of each iteration. Each processor has to send 
one element per block to the next, and it has to receive a number of messages 
equal to the number of blocks of the previous processor. This scheme implies a 
point to point communication of several messages of one element. In our parallel 
algorithm, each processor packs all the elements to be sent in one message. The 
destination processor receives it, unpacks the scalars and places them into the 
destination blocks. In Fig. 2 we show this parallel shift process. The total number 
of messages is reduced from O(^) to 0{pn) and, therefore, the global latency 
time is also reduced. 



3 Improving the Accuracy of the Method 

When the R factor in (3) is ill-conditioned, we can expect a large error in the 
matrix G computed by Generalized Schur Algorithm. Therefore, we need to 
apply the correction step proposed in [10] in order to refine the R factor obtained 
by the Generalized Schur Algorithm described above. 
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V6 

V7 

V8 

V9 

VIQ 

VII 
V12 
V13 
V14 
V15 
VI6 
V17 
V18 



Fig. 2. Shift of elements 3 to 20 of a vector of 24 elements distributed on a 3x 1 
processors grid. The block size is v=3. 



We assume the following partition of the Toeplitz matrix T 

r-f^ofn _ [To Ic \ 

\fcTo) {ijt^-nj^ 

where Tq G is a Toeplitz submatrix of T, fr, Ir G and 

A, Ic e r(— i)xi . 

Indeed, we assume a partition of the matrix composed of the factor G and 
the vector w in (2), 



G 



( UJl ujI\ / Ujf \ 

511 5 ?^ = Gt 9c 

\0 GbJ \0 9nnJ 



( 5 ) 



where Gb,Gt G R^” i)x(n i)^ G R^" i)xi^ 5ii,5rm, wi, G R. We 

also define a matrix G G guch as 



G'^G = GjGt+UJt^J + frfr ■ 



(6) 



It can be shown that the matrix G is the upper triangular R factor of the 
QR decomposition of matrix X, 



G = qr(A:) , 







G r(™+i)><(”-i) 



( 7 ) 
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Basically, the refinement step proceeds as follows. First, we have to find an 
orthogonal matrix which transforms the first two columns of [6 T] to upper 
triangular form and, accordingly, the first two rows into 



We define a matrix Wi as 



/^K cvi 

511 9r ) 



Wi 



0 Wi W2\ 

1 0 0 J ’ 



(8) 

(9) 



where wi and W 2 denote the first and second columns of matrix W. 

Once the factor Gt in (5) has been obtained from the Generalized Schur 
Algorithm, the matrix G is computed by updating Gt via Givens rotations. 
Then, a refined matrix Gb (5) is obtained by downdating the block H from G, 

/^T /-i AT A jtT tt 

(_T^ (_T{) V_J (_J 11 11 , 

where = (A Wfc) £ ^ 

The downdating process is performed by solving the LS problem 

min||VFi - AF||f , (10) 

with the Gorrected Semi-Normal Equations method (CSNE). 

Once the LS problem (10) is solved, we obtain an orthogonal matrix Qi S 
r("“ 1)^3 an upper triangular matrix F e and we construct the fol- 

lowing matrix 

( r o) ■ <“> 

that we have to triangularize by a product of Givens rotations M, 

“(ro) = (olJ- 

in order to obtain the refined factor Gb and, therefore, a more accurate R factor 
of the QR decomposition of the matrix T. 

All steps described above are summarized in the following algorithm. 

Algorithm 1 (CSNE Refinement Step). 

Let Gt denote the triangular factor (5) computed from the Generalized Schur 
Algorithm, and W\ the matrix defined in (9) and let X he defined as in (7), the 
refinement step proceeds as follows: 

1. Compute G by triangularizing {Gj ujt fr)- 

2. Compute Q\, V and F from 

G'^Qi = X'^Wi , GV = Qi , F:=Wi-XV . 
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3. (a) Update Qi, V , and F: 



G^Qi — F , Qi : — Qi + , 



GV = Q[ 



1 ) 



F := F - XV' . 



(b) Compute the upper triangular factor F of the QR decomposition of F. 

'Qi G' 



f. Triangularize 



r 0 



by a product of Givens rotations M : 



h H 

0 R 



:= M 



Qi G 
F 0 



The computational cost of the Generalized Schur Algorithm and the Refine- 
ment Step as in Algorithm 1 is 13mn + 24. 5n^ flops [10]. 



4 The Parallel Refinement Step 

The correction step proposed in [10] produces some additional cost in the global 
Toeplitz LS solver, and we can minimize its impact with the use of parallelism. 
The keys to obtain a good parallel version of the correction process are an ade- 
quate distribution of the data and the construction of new appropriate parallel 
routines to solve some steps of the algorithm. 

A cyclic row block distribution with a block size v (v must be greater of 
equal to 3) has been used (Fig. 3). The workspace proposed is divided into two 
distributed sub- workspaces: the generator, with entries denoted by G and the 
rest, whose entries q and F denote the entries of the matrices Q\ and F (11) 
respectively before applying transformation M (12). The entries G denote the 
triangular factor obtained by the Parallel Generalized Schur Algorithm that will 
be updated later in order to obtain the factor G defined in (6). The rest of 
the entries are distributed matrices used for auxiliar purposes as described in 
Algorithm 2. Thus, a local workspace called F of size (m-F 1) x 3 is used by each 
processor. 

Algorithm 2 (Parallel Refinement Step). 

Given the triangular factor Gf computed on workspace G (Fig. 3) by the Parallel 
Generalized Schur Algorithm, and given a matrix W\ (9) replicated on the local 
workspace F in all processors, this algorithm obtains the refined triangular factor 
GJ (5) in the entries Gf, of workspace shown in Fig. f. 

1. Update the distributed factor G by triangularizing the matrix (G Ai A 2 ), where 
columns Ai and A 2 (first two columns of workspace k) correspond to Ut and 
fr respectively. Use parallel Givens rotations in order to form the triangular 
factor G^ in (6). 

2. Perform de following steps: 

(a) Compute A := X^F. All processors know matrix X because it is not ex- 
plicitly formed, and all have matrix W\ replicated on their local workspace 
F. Each processor can calculate its local blocks of the distributed matrix 
A without communications. 
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000000 
PO 0 0 0 0 0 0 
000000 
G G G G G G 
G G G G G G 
PI G G G G G G 
G G G G G G 
G G G G G G 
G G G G G G 
P2 G G G G G G 
G G G G G G 



qqqqqqqqFOOOOOO 0 
qqqqqqqqrrOOOOO 0 
qqqqqqqqrrr000 viV 2 
G0000000AAABBBviV2 
GG000000AAABBBviV2 
GGG00000AAABBBviV2 
GGGG0000AAABBBviV2 
GGGGG000AAABBBviV2 
GGGGGG00AAABBBviV2 
GGGGGGG0AAABBBviV2 
GGGGGGGGAAABBBviV2 



Fig. 3. Parallel workspace used in the refinement step distributed over a 3x1 pro- 
cessors grid. The size of block is f. Matrices are stored in memory in transposed 
form. 



(b) Solve the triangular linear system A := in parallel. 

(c) Solve the triangular linear system A := G“^A in parallel. 

(d) Perform the operation F F — Xk. Each processor Pk calculates locally 
Fk, k = 0, . . . ,p — 1 and, by a global sum F := Y%=o ^k, all processors 
obtain F. 

3. Save A into B. 

f. Perform the following steps: 

(a) Perform the following sub-steps: 

i. Compute A := X^F. Each processor can calculate its local blocks of 
the distributed matrix A without communications, 
ii. Solve the triangular linear system A G“^A in parallel, 
in. Update factor B, B := B + A. 

iv. Solve the triangular linear system A := G“^A in parallel. 

V. Perform the operation F := F — Xk. Each processor Pk calculates 
locally Fk, k = 0, . . . ,p — 1 and, by a global sum F ^k- In 

this case, only the processor Pq will have the resulting factor F of the 
global sum. 

(b) If Pk = Pq, calculate triangular factor P of the QR decomposition of F 
and copy it transposed on entries denoted by F in Fig. 3. 

5. Perform the following steps: 

(a) Redistribute factor B to the entries denoted by q in Fig. 3. These entries 
are owned by proeessor Pq because in the distribution chosen v > 3. 

(b) Set the entries A to zero and triangularize the workspace formed by en- 
tries q, r, G and A (the matrix defined in (12) in transposed form) by 
a product of parallel Givens rotations M. The result of this operation is 
the triangular factor G^ in Fig. f. 

The parallel Refinement Step in Algorithm 2 involves several basic computa- 
tions that we summarize below: 
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000000 
PO 0 0 0 0 0 0 
000000 
000000 
000000 
PI 000000 
000000 
000000 
000000 
P2 0 0 0 0 0 0 
000000 



10000000 
01000000 
00100000 
HHHGtO 0 0 0 
H H H G6 G6 0 0 0 
H H H Gt Gt Gt 0 0 
H H H G6 G6 G6 G6 0 
H H H G6 G6 G6 Gb Gb 



0 0 00000 0 
0 0 00000 0 
0 0 0 0 0 0 vi V 2 
0 0 0 B B B vi V 2 
0 0 0 B B B vi V 2 
0 0 0 B B B vi V 2 
0 0 0 B B B vi V 2 
0 0 0 B B B vi V 2 



H H H Gfo Gfo Gfo Gfo Gfo Gfo 0 0 B B B vi V 2 

H H H G^) G^) G^) G^) G^, G^, G^ 0 B B B vi V2 

H H H G^) G^) G^) G^) G^) G^) G;, G;, B B B vi V2 



Fig. 4. Parallel workspace of the algorithm after the refinement step. 



— Two matrix-matrix multiplications of the matrix (7) by the local work- 
space F. These computations are performed in parallel without communica- 
tions because matrix X is not explicitly formed and the second matrix is 
replicated in all processors. 

— The solution of three triangular linear systems involving the distributed ma- 
trix denoted in Fig. 3 with entries G and the distributed matrix denoted with 
entries A. We have applied the corresponding PBLAS routines to perform 
these operations in parallel. 

— Two matrix-matrix multiplications involving the matrix X and the dis- 
tributed matrix A. Each processor computes a local addition and, by means 
of a global sum, the result is distributed over all processors in the first case, 
or only stored on processor Pq in the second case. 

— Several parallel Givens rotations appearing in steps 1 and 5b. These steps 
have been performed by blocks in order to minimize the number of messages 
needed to broadcast the Givens rotations. 

The main problem of the previous algorithm is the large number of different 
basic computations to perform and the number of different matrices and vec- 
tors involved. We have distributed all data in order to optimize the different 
steps and trying to minimize the communication cost. Standard routines from 
ScaLAPACK and PBLAS have been used for matrix distributions and for the 
solution of distributed triangular linear systems. On the other hand, we have 
implemented several specific routines in order to perform the matrix-matrix 
multiplications and the triangularization of workspaces with Givens rotations 
as described above. A more detailed description of these routines can be seen 
in [2]. 

5 The Parallel Algorithm 

In this section we show a very summarized version of the whole parallel al- 
gorithm. Step 2 of Algorithm 3 corresponds to the Parallel Generalized Schur 
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Algorithm described in section 2, while step 3 corresponds to the parallel Re- 
finement Step described in section 4. 

Algorithm 3 (Parallel Algorithm for the Toeplitz LS Problem). 

Given a Toeplitz matrix T e and an arbitrary vector b e i?™, this algo- 

rithm computes in parallel the solution vector a; G i?" of the LS Problem (1) in 
a stable way. 

1. Compute values k, lo and g = [gn gj) (8) by a QR decomposition of the 
first two columns of [b T], Save cv in the distributed workspace vi and g in the 
distributed workspace V 2 (Fig. 3). Form the generator Q of the displacement 
matrix V z ( 4 ) distributed over the workspace denoted by entries G in Fig. 3. 

2. Compute the triangular factor Gt G to toe distributed 

workspace 5 using the Parallel Generalized Schur Algorithm described in sec- 
tion 2. 

3. Compute the triangular factor Gb G on to workspace 

(Fig. 4 ) applying the parallel Refinement Step described in section 4 - 

4 . Using the scalar k, the vectors u> and g stored on vi and V 2 respectively, and 
the refined factor Gt, compute the operation described in (3) via a product 
of parallel Givens rotations J. Solve the triangular linear system Rx = b in 
parallel (using PEL AS) in order to obtain the solution vector x of the LS 
problem (1). 

6 Experimental Results 

First, we have tested our parallel algorithm by using the matrices proposed 
in [10], concluding that the parallel version preserves the stability properties of 
the sequential algorithm. 

In this section we show the experimental results obtained with our parallel 
algorithm using a cluster of 32 personal computers. Each node contains a Pen- 
tium II microprocessor, and all them are connected through a Myrinet network. 
This environment provides good ratios in terms of computation and communi- 
cation speed and also in terms of price and performance. Besides, this type of 
multicomputer can be easily upgraded and it allows the use of standard libraries 
like MPI and ScaLAPACK over the Linux Operating System. 

In Table 1 we compare the results of our parallel algorithm with PDGELS, a 
parallel routine for solving the general least squares problem included on the 
ScaLAPACK library. Algorithm PDGELS is more efficient, but its temporal cost 
is worst. This happens because the algorithm included in ScaLAPACK does not 
have into account the structure the Toeplitz matrices. 

In Table 2 we show the duration and efficiency of Algorithm 3 with matrices of 
different sizes, using different number of processors. All the results are obtained 
using the best block size (v) in each case. Good efficiencies are obtained with few 
processors, but this factor decreases when the number of processors increases. 
Indeed, the results improve when we increase the size of the matrices and when 
m » n (Fig. 5). This behaviour of the parallel algorithm is due, mainly, to the 
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Table 1. Comparison between PDGELS routine and Algorithm 3. The table shows 
the time in seconds and the efficiency for 1, 2. 4, 8 and 16 processors. The best 
block size and logical grid have been used for PDGELS routine. 



Algorithm 3 


m X n 


1 


2 


4 


8 


16 


1000 X 100 


0.143 


0.105 


68% 


0.086 


42% 


0.075 


24% 


0.083 


11% 


1000 X 500 


0.868 


0.674 


64% 


0.463 


47% 


0.407 


26% 


0.341 


16% 


1000 X 1000 


2.144 


1.715 


62% 


1.173 


46% 


0.988 


27% 


1.016 


13% 


PDGELS routine from ScaLAPACK 


m X n 


1 


2 


4 


8 


16 


1000 X 100 


0.272 


0.162 


84% 


0.112 


61% 


0.104 


33% 


0.104 


16% 


1000 X 500 


3.121 


1.838 


85% 


1.263 


62% 


0.923 


42% 


0.778 


25% 


1000 X 1000 


8.010 


5.272 


76% 


3.725 


54% 


2.301 


43% 


1.650 


30% 



low computational cost of the sequential algorithm. We are trying to parallelize 
an algorithm that exploits the special structure of the Toeplitz matrices and 
that reduces the cost from O(mn^) to 0{mn). It is widely known that in a 
distributed environment, a very important factor to obtain good performance is 
to reduce the communication cost and/or, at least, to increase the ratio between 
computational and communication costs. In the case of a sequential algorithm 
with so small computational cost, any communication introduced in the parallel 
implementation has an enormous impact in the performance. 



Table 2. Parallel results in time (seconds) and efficiency for several Toeplitz 
matrices of different number of rows and columns. 



m X n 


1 


2 


4 


8 


16 


1000 X 100 


0.143 


0.105 


68% 


0.086 


42% 


0.075 


24% 


0.083 


11% 


1000 X 500 


0.868 


0.674 


64% 


0.463 


47% 


0.407 


26% 


0.341 


16% 


1000 X 1000 


2.144 


1.715 


62% 


1.173 


46% 


0.988 


27% 


1.016 


13% 


2000 X 200 


0.600 


0.383 


78% 


0.249 


60% 


0.196 


38% 


0.178 


21% 


2000 X 1000 


3.630 


2.555 


71% 


1.580 


57% 


1.223 


37% 


1.068 


21% 


2000 X 2000 


9.225 


6.893 


67% 


4.151 


55% 


3.124 


37% 


2.922 


20% 



We also analyze the impact of the introduction of the Refinement Step in the 
algorithm. In Table 3 we indicate separately the cost of the Generalized Schur 
Algorithm and the cost of the Refinement Step. We can see that the sequential 
cost of the Refinement Step is much higher than the cost of the Generalized 
Schur Algorithm. Therefore, increasing speed-ups can be expected from a good 
parallelization of step 3 in Algorithm 3. Note also that the parallel speedup of 
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the Generalized Schur Algorithm is strongly inhibited by the communications 
involved in this part of the algorithm. Indeed, the necessity to broadcast the 
transformation factors at each iteration and the displacement of the first col- 
umn of the generator involves a large communication cost. This cost is even 
greater if compared with the very small computational cost of applying the 
transformations. 



Table 3. Time in seconds and efficiency of Steps 2 and 3 in Algorithm 3. 



m X n 




1 


2 


4 


8 


16 


2000 X 200 


Schur 


0.016 


0.028 


29% 


0.030 


13% 


0.036 


6% 


0.043 


2% 




Refi. 


0.522 


0.331 


79% 


0.199 


65% 


0.147 


44% 


0.123 


27% 


2000 X 1000 


Schur 


0.401 


0.296 


68% 


0.249 


40% 


0.245 


20% 


0.257 


10% 




Refi. 


2.915 


1.817 


80% 


1.050 


69% 


0.718 


51% 


0.579 


31% 


2000 X 2000 


Schur 


1.583 


0.972 


81% 


0.694 


57% 


0.602 


33% 


0.579 


17% 




Refi. 


6.909 


4.382 


79% 


2.542 


68% 


1.739 


50% 


1.370 


32% 



In order to fully exploit the parallel system, we have also analyzed the influ- 
ence of scaling the problem with the number of processors. Specifically we have 
used an isotemporal scale, increasing the size of the problem with the number of 
processors in order to maintain the temporal cost of the parallel algorithm. The 
cost of the problem is 0{mn), so we have scaled both factors, m and n, with 
the square root of the number of processors. The scaled speedup in this case is 
given by 



pTs{m,n) 
Tp{m', n') 



(13) 



where p is the number of processors, Ts{m,n) is the sequential time of the al- 
gorithm with a matrix of size m x n and Tp(m',n') is the time of the parallel 
algorithm using p processors and with a matrix of size m' x n' . For example, 
if m = n = 400 with one processor, m' = n' = 800 with four processors and 
m' = n' = 1600 with 16 processors. 

In Fig. 5 we show the scaled speedup in the case of matrices with a different 
relation between m and n. Specifically we show the results with square matrices 
(to = n), with matrices in which m = 2n and with very rectangular matrices 
(to = lOn). In all cases, we start with a matrix in which n = 400 in the sequential 
case. The results are not close to the optimum, but we obtain scaled speed 
up greater than 10 with 32 processors. Given de specific characteristics of the 
parallelized algorithm, this can be taken as a good result. 

Fig. 5 also shows that we obtain better results with rectangular matrices than 
with square matrices. This behaviour of the algorithm shows that the computa- 
tion cost of the parallel algorithm depends on both factors, to and n, while the 
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Scaled Speed Up 




Processors 

Fig. 5. Scaled speedup using an isotemporal scale (n = 400 in the sequential 
case). 



communication factor increases mainly with n. Indeed, during the first phase of 
the algorithm the communications only depend on n. 



7 Conclusions 

In this paper we have presented a new parallel stable algorithm for solving the 
least squares problem with Toeplitz matrices. This algorithm exploits the special 
structure of this class of matrices in order to reduce the cost and is mainly based 
on the parallelization of the method presented in [10]. 

We have parallelized the two main phases of the method: the Generalized 
Schur Algorithm and the refinement process to obtain a more accurate re- 
sult. The parallel algorithm maintains the stability properties of the sequential 
method and, therefore, offers a similar accuracy than the QR method based 
on Householder transformations. The parallel algorithm has been developed us- 
ing a standard environment, thus producing a portable code to different parallel 
environments. We have used the ScaLAPACK library based on the MPI message- 
passing library. 

The specific characteristics of the algorithm do not allow to exploit the bidi- 
mensional parallel model of the ScaLAPACK library. The second step of the 
Algorithm 3 is based on the transformation of the generator of the Toeplitz ma- 
trix that has n — 1 rows, but only six columns. Therefore, we have had to use a 
unidimensional grid in order to approach efficiently this phase of the algorithm. 
During the Refinement Step we have tried to reduce the communication cost by 
using an appropriate workspace and we have combined ScaLAPACK routines 
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with other routines designed specihcally to approach the different stages of the 
refinement. In order to apply the Givens rotations it is suitable to use the same 
logical grid of p x 1 processors. 

An experimental analysis has been carried out in a cluster of personal com- 
puters based on a high performance interconnection network. The results show 
good efficiencies with a reduced number of processors, but they are not so 
good when we use a large number of processors. The main reason of this be- 
haviour is the very small computational cost of the algorithm that we have 
parallelized. However, Algorithm 3 is faster than the corresponding standard 
routine in ScaLAPACK. 



References 

[1] Pedro Alonso, Jose M. Badia, and Antonio M. Vidal. Un algoritmo paralelo 
estable para la resolucion de sistemas de ecuaciones toeplitz no simetricos. In 
Actas del VI Congreso de Matemdtica Aplieada (CM A), Las Palmas de Gran 
Canaria, volume II, pages 847-854, 1999. 

[2] Pedro Alonso, Jose M. Badi'a, and Antonio M. Vidal. Algoritmos paralelos para la 
resolucion de sistemas lineales y del problema de mmimos cuadrados para matrices 
Toeplitz no simetricas. Technical Report II-DSIC-2/2000, Universidad Politecnica 
de Valencia, January 2000. 

[3] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Green- 
baum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK 
Users’ Guide. SIAM, Philadelphia, second edition, 1995. 

[4] L. S. Blackford, J. Choi, and A. Cleary. ScaLAPACK Users’ Guide. SIAM, 1997. 

[5] Richard P. Brent. Stability of fast algorithms for structured linear systems. In 
T. Kailath and A. H. Sayed, editors. Fast Reliable Algorithms for Matriees with 
Strueture, pages 103-116. SIAM, 1999. 

[6] S. Chandrasekaran and Ali H. Sayed. A fast stable solver for nonsymmetric 
Toeplitz and quasi- Toeplitz systems of linear equations. SIAM Journal on Matrix 
Analysis and Applications, 19(1):107-139, January 1998. 

[7] J. Chun, T. Kailath, and H. Lev-Ari. Fast parallel algorithms for QR and tri- 
angular factorization. SIAM Journal on Scientific and Statistical Computing, 
8(6):899-913, November 1987. 

[8] Gene H. Golub and Charles F. Van Loan. Matrix Computations, volume 3 of 
Johns Hopkins Series in the Mathematical Sciences. The Johns Hopkins University 
Press, Baltimore, MD, USA, second edition, 1989. 

[9] Thomas Kailath and Ali H. Sayed. Displacement structure: Theory and applica- 
tions. SIAM Review, 37(3):297-386, September 1995. 

[10] Haesun Park and Lars Elden. Schur-type methods for solving least squares prob- 
lems with Toeplitz structure. SIAM Journal on Scientific Comnutinq, 22(2):406- 
430, July 2000. 

[11] J. Schur. Uber Potenzreihen, die im Innern des Einkeitskreise beschankt sind. 
Journal fiir die reine und angewandte Mathematik, 147:205-232, 1917. 

[12] M. Snir, S. W. Otto, S. Huss-Lederman, D. W. Walker, and J. Dongarra. MPI: 
The eomplete reference. MIT Press, MA, USA, 1996. 

[13] D. R. Sweet. Fast Toeplitz orthogonalization. Numerisehe Mathematik, 43(1):1- 
21, 1984. 




An Index Domain for Adaptive 
Multi-grid Methods* 



Andreas Schramm 



RWCP Parallel and Distributed Systems GMD Laboratory 
Kekulestr. 7, 12489 Berlin, Germany 
schraimnSf irst . gmd.de 



Abstract. It has been known for some time that groups as index do- 
mains of indexable container types provide a unified view for “geometric” 
(grids) and “hierarchic” (trees) spatial structures. This conceptual uni- 
fication is the starting point of further generalizations. 

In this paper we present a new kind of index domains that combine 
both kinds of structure in a single index domain. Together with the 
“structured-universe approach”, these new index domains constitute a 
framework for an expressive description of adaptive multi-grid discretiza- 
tions and algorithms. 

Keywords: Programming models, data parallelism, container types, 
structured-universe approach, multi-grid, indexable types, groups. 



1 Introduction: Infinite Index Domains and the 
“Structured-Universe Approach” 

As is well known, virtual memory allows for a dynamic extensibility of data 
structures like stacks and heaps under preservation of their logical contiguity in 
the address space. The memory-management unit (MMU) inserts an abstrac- 
tion layer which maps a finite number of finite substructures (the “pages” ) of a 
conceptually infinite address domain (INg) onto some physical representation. 

The structured-universe approach is a high-level container type concept with 
a similar kind of abstraction as virtual memory [12]. Its data types, called “power 
types” , are indexable types with infinite index domains and a distinguished de- 
fault “zero value” for the element type (0.0 for REAL, etc.). 

By appropriate operands and data parallel operations, arbitrary elements of 
power-type variables can be overwritten, finitely many at a time. Thus, power- 
type variables always have finitely many non-zero elements (the black in 
Fig. 1); this property is somewhat reminiscent of infinite-dimensional vector 
spaces. The state-changing operations can alter finite substructures indexed by 
chunks of any shape and size and at any location in the index domain (in contrast 
to the allocation of fixed pages). This allows for a convenient modeling of dy- 
namic and irregular data structures under preservation of their logical contiguity 
and neighbourhood structure in their global problem-specific index domain, and 

* This work was supported by the Real World Computing Partnership (RWGP), Japan. 
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element type A with zero value “0” 
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. . . Structured infinite index domain . . . 

black •'s: arbitrary (finite) support 

Fig. 1. The “structured-universe approach” : Indexable types with infinite index 
domains and a default “zero” value for the element type. For variables, the 
supports are restricted to be finite 



leads to compact programs that are close to the problem’s underlying mathe- 
matical formulae. 

Both burden and freedom of setting up the internal technical representa- 
tion for this “shape-and-granularity polymorphism” are then transferred to the 
underlying abstract machine, which has to act as something like an “Index Do- 
main Management Unit” (in analogy to the MMU). The structural information 
necessary to do so efficiently on a distributed-memory machine (especially lo- 
cality information) is contained — partially statically and partially dynamically, 
depending on the nature of the application — in the index domains, the data and 
communication patterns, and the operations with them. 

An approach of preserving problem-specific structure of index domains is 
worth as much as the latter indeed have something in them that is worth to 
be preserved. Therefore the structured-universe approach is equipped with a 
variety of problem-specific index domains, which are infinite and more general 
than usual also in other ways to be seen later. A non-obvious example of these 
index domains is the topic of this paper. 

Overview: In Sect. 2 and 3, we analyze the formal properties of index do- 
mains in general and for multi-grid data in particular. In Sect. 4 through 6, we 
sketch a small sample problem, an algorithm, and program text, and summa- 
rize the relations between the respective abstract properties of the application 
and the programming model employed. In Sect. 7 and 8, we make comparisons, 
summarize, and draw conclusions. 

2 What Accounts for the “Right” Index Domain, 
and Why? 

The index domains effect a problem-specific geometrization of container data. 
As for the “right” index domains, for instance we intuitively feel that a two- 
dimensional grid should be modeled by a two-dimensional array, and that its 
mapping onto a one-dimensional address space should be done by the compiler. 
Analogous considerations hold for higher dimensions and, as we shall see, can also 
be applied to structures that are usually not perceived as indexable ones, such as 
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trees. A formalization of this intuition leads to following criteria of “naturalness” 
of index domains: 

1. Nearest-neighhour relations must correspond to index-arithmetically small 
distances. Simultaneous nearest-neighbour communications in parallel algo- 
rithms must be “parallel shifts” of data within an index domain. 

2. Multiple non-elemental substructures of a power-type entity must be index- 
able descriptively by multiple congruent subsets of the index domain (e.g., 
the rows in a matrix). 

(Multiple non-elemental substructures occur for instance in routine liftings 
that express nested parallelism. Multiple substructures of congruent shapes 
correspond to what other container- type concepts express by multiple sub- 
structures of the same type [10].) 

If the structured-universe approach is used with the right index domains, irreg- 
ularity and dynamicity of spatial structures typically go into the supports of the 
data (the black in Fig. 1), while the communication patterns and data decom- 
position schemes retain their regularity in the infinite index domains. Pointers 
and indirect indexing — which are the structureless “spaghetti” implementation 
techniques in this field — need to be employed less frequently. 



Groups as Index Domains. It has been known for some time that finitely 
generated groups constitute a unified index domain concept for grids and trees 
in the sense explained above [5,10]. The following correspondences hold between 
spatial structures and the (infinite) groups into which they are embedded as 
substructures: 



grids C free Abelian groups 



+ 



trees C free groups 



degree of commutativity 



( 1 ) 



Now, with groups as index domains, the parlance changes a bit: 

1. The role of describing “small distances” , formerly played by (tuples of) small 
integers, is now played by (sums of few of) the generators of the group. 

2. “Parallel shifts” within an index domain, and congruence of subsets, are de- 
fined in terms of the respective group operation, here generically written “ 0 ” . 

For integer grids, these terms merely rephrase the intuitive understanding. Analo- 
gously for trees and free groups: The neighbourhoods characterized by the gen- 
erators of the group are those between parent and child, and communications 
between parents and their respective childs are described by parallel shifts of 
data within the index domain by a small index-arithmetic distance. The non- 
commutativity of the corresponding groups reflects the special geometry of trees, 
which after all is different from that of grids. 

In short, groups as index domains constitute a unification of container struc- 
tures that are commonly regarded as quite different. And even better, this 
unification is the starting point for a further generalization, which we now begin 
to introduce. 
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Fig. 2. Section of an example group that models the geometry of multi-grid 
discretizations. Observe the close interaction of the grid-like and the tree-like 
spatial structures, as formalized by relation (2) 



Degree of Commutativity. We begin with a remark about commutativity of 
groups. There are several ways to attribute a gradated “degree of commutativity” 
to groups, as opposed to a mere Abelian-or-not classification. In all of these ways. 
Abelian groups and free groups mark the opposite extreme cases. So it appears 
to be natural to investigate whether “intermediate” groups between the extremes 
serve some purpose. This is indeed the case, and one of the possibilities to fill in 
the ellipsis in (1) is the kind of groups we are going to present in the next section. 
It is not much of a surprise that this kind of groups exhibits an amalgamation 
of both grid-like and tree-like properties in the same index domain. 



3 The Index Domain for Mnlti-grid Data 

Multi-level methods (methods that employ multi-level discretizations) occur in 
various fields. They are renowned for their efficiency and, in the case of the dy- 
namically adaptive variant on distributed-memory machines, notorious for their 
difficulty of programming. They are treated in more depth e.g. in [2,6]; here we 
just mention that their characteristic property is the combined use of discretiza- 
tions of the same physical space at different levels of resolution. The algorithms 
typically employ both intra-level and inter-level communications. Here we con- 
fine ourselves to geometric multi-grid methods. 

Our starting point is the observation that the spatial resolution of the dis- 
cretization (usually) doubles in the transition from one level to the next one. For 
an illustration we assume a two-dimensional integer grid (index domain and 
use the term “one level down” for the transition to the next level with doubled 
resolution. Then, in order to cover a certain distance x at one level farther down, 
we have to go twice as many steps. (E.g., first going east one step and then going 
down is the same as going down first and then going east two steps; see Fig. 2). 

This observation can very well be formalized as a relation within a non- 
Abelian group: 



X 0 down = down © a; 0 a: for all a: G . 



(2) 
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So we construct the index domain for a multi-level discretization of a two- 
dimensional domain as follows: The group 2Z^ is extended by an additional 
generator, called “down”, and made subject to the relation (2). Figure 2 shows 
a section of the resulting index domain, which clearly exhibits the desired multi- 
level nature. 

We compare the two above-mentioned communication relations in multi-level 
methods with the geometry represented by this group^: Intra-level nearest- 
neighbour communications (e.g., in the computation of point-wise residuals) 
work just as in integer grids. Inter-level communications (e.g., in the compu- 
tation of prolongation and restriction operators) can be expressed in the same 
way by data shifts by small distances, using down or its inverse, respectively. 

In summary, the presented index domain is capable of formalizing both kinds 
of locality of originally different nature. Hence, both kinds of (translation-invari- 
ant) communication can be expressed as convolutions by appropriate stencils. 
The only difference is that the convolution takes place in the new kind of index 
domain and is defined by means of the group operation “©” . 

Groups that model the geometry of anisotropic (nonstandard) coarsenings 
can be constructed similarly, but this is not carried out here. 



4 A Sample Problem and Its Numerical Method 

4.1 The Problem 

The motivations for multi-level approaches are (i) faster convergence, and (ii) 
adaptive refinements, for a reconciliation of accuracy and efficiency. 

As an example for both the structured-universe approach and the new kind 
of index domains, we present an adaptive multi-grid application. We consider a 
simple boundary-value problem. We assume as given 

a domain Q = (a, h) x (a, b) C 
a function / : 17 ^ IR 

a boundary function F : 5l7 ^ IR 

and seek as solution 

u : 17 ^ IR 

with Lu = —Au = f on 17 (3) 

and u\sn = F . 

We assume that the right-hand side / possesses a singularity somewhere on the 
boundary 6f2, so that the problem calls for adaptive refinement. 

^ Recall that the purpose of the index domains in the structured- universe approach 
is to express the “natural” problem-specific neighbourhoods and congruences within 
container data, as explained in Section 2. 
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Fig. 3. Data fields and dependences for the modified full multi-grid (FMG) 
scheme. Larger level numbers refer to finer grids. With spatial adaptivity, some 
finer levels may represent only subsets of the problem domain 



4.2 The Numerical Method 

Data Fields and Basic Operations. For an initial coarse level 0 and for 
an indeterminate number of successively finer levels, the following infinite-grid 
quantities with finite supports are maintained: “interpolated solution”, “solution 
corrector” , “residual” , and “right-hand-side perturbation” ; these names may ap- 
pear abbreviated in equations and program texts. Figure 3 sketches the data 
structure and the data flows therein. 

The residual follows the other quantities so that the following variant of (3) is 
fulfilled (in its respective discretized form): 

L{interpol. solution + soln. -corrector) = f + RHS -perturbation + residual (4) 

The solution algorithm will be constructed from the following four basic 
operations (larger level numbers correspond to finer resolutions): 

1. Initialization at level 0: At the coarsest level, the (small) system of equations 
is solved, and the solution is stored into the field interpolatedsolution. 

2. Interpolation from level k to A suitable interpolation operator is applied 
to the sum interpolatedsolution + solution-corrector of level fc, and the result 
is stored into the field interpolatedsolution of level fc -I- 1. 

3. Smoothing at a level k: A smoothing method is applied to the residual at 
level k, and the resulting correction values are added to the already existing 
solution-corrector. (The residual decreases accordingly.) 

4. Restriction (residual coarsening) from level fc-|-l to fc: A restriction operator 
is applied to the residual at level A: -1-1, and the result is stored into the field 
RHS -perturbation of level k. 
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Organization of the Basic Operations. In the multi-grid terminology, the 
method presented here is a full multi-grid (FMG) scheme with V(l,l)-cycles. It 
is organized as follows: After the initialization at level 0, the process descends 
(i.e., interpolation followed by smoothing) to a certain maximal depth and then 
ascends (i.e., restriction followed by smoothing) back to level 0. These descents 
and ascents are continued with a successively increasing maximal depth until no 
further refinement is necessary. 

For simplicity of presentation, the algorithm presented here deviates some- 
what from the conventional ones by the following modifications: We neglect the 
fact that usually different interpolation operators are employed in the FMG 
refinements and the multi-grid cycles. Second, the coarse-grid corrections are 
calculated for the perturbed original equation (this is the full approximation 
scheme used for non-linear equations), and not from the pure defect equation. 
Third, the coarse-grid corrections of refined (smaller) subgrids nevertheless take 
place in the larger subregions pertaining to the coarser grids. This appears to 
be more intuitive, as even a residual with a spatially limited support may very 
well lead to a global correction of the solution. 

Spatial adaptivity consists in the technique that increasingly finer resolutions 
(with larger computational effort) are applied only to increasingly smaller subre- 
gions of the problem domain, under control of some refinement criterion (a local 
discretization-error estimator). These subregions turn out to be the neighbour- 
hoods of the singularity of the right-hand side / of (3). (We assume that / 
possesses only one singularity, so that the latter can be enclosed at each level by 
a single rectangular subdomain.) 

FMG, if used with a sufficiently good interpolation operator, has the property 
that for each level of refinement, an accuracy up to the corresponding discretiza- 
tion error of that level is achieved already after a single multi-grid cycle. We 
exploit this property and assume convergence to have occurred too when no 
further local refinement is required. 



5 The Program 

5.1 Prerequisites and Basic Program Patterns 

The program will be presented in an experimental linguistic concretization of 
Universe [11] on the top of Oberon-2 [9]. Keywords and the predefined iden- 
tifiers of the host language are in all-caps. For space considerations, only very 
brief explanations are given here. 

An operation pattern that will occur frequently in the program text is the 
following one: 

power _type-variable[subdomain] : = power _type -value $$ power -type -value] (5) 

where one operand of the infix operator “$$” (convolution) typically identifies a 
static communication pattern (a stencil). 
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In such statements, some elements of a power-type variable are overwritten, 
viz. those elements that are indexed by the subdomain on index position, the 
“selection mask” . In a writing access like here, masking of a power-type variable 
means that only the selected elements are overwritten. Masking of a power-type 
value means that the non-selected elements are replaced by zero in the result. 

The subdomain expressions in the program texts below might appear quite 
complicated at first sight, but they resemble the conventional mathematical no- 
tations of integer intervals, element-wise sums of sets, Cartesian products, etc. 

The infix expression on the right-hand side of the assignment is a (discrete) 
convolution product (a shortcut for “$ * REDUCE BY + $” if the element types of 
the operands are numbers). It yields a result with the same index domain as 
the operands, and for all non-zero elements Xi of the first operand and yj of the 
second operand, the products Xi * yj are accumulated into the element Zi^j of 
the result z. 

Convolutions are the method of choice for a (non-redundant) expression of 
translation-invariant communications (data movements) within an index do- 
main. In all usages here, one of the operands is a static pattern (a stencil), 
which represents the discretization of the underlying linear operator. 

A sensible implementation will compute only those elements of the right- 
hand side that are actually used and not masked out in the assignment. In order 
to facilitate this, power-type products (e.g., the convolution) and also implicit 
liftings of scalar pure functions and operators have lazy semantics in Universe. 



5.2 Global Declarations 

First, two index domain are declared, viz. the two-dimensional infinite integer 
grid and the index domain of Sect. 3. 

The INDEXCOUNTER declaration declares two symbolic power-type constants 
Xcoord and Ycoord with index domain 2Z y. 7Z and element type 2Z. These 
constants provide the “canonic” x and y coordinates of the integer grid; after a 
multiplication by the appropriate mesh size they will be used to parametrize the 
parallel invocations of the right-hand side / and of the boundary condition F . 
For every index point that can be written as sum of (1, 0) and (0, 1) and their 
inverses, the respective associated symbolic counter indicates how many of the 
generators are used to express that index point. 

The variable values holds the data structure depicted in Fig. 3; for each level, 
regions [level] holds the integral corner coordinates of the finite rectangular 
subgrids that correspond to fl or its refining subregions, respectively. 

INDEXDOMAIN 

PlaneGrid = 7Z y 7Z \ 

MultiGrid = EXT (PlaneGrid, Down, 2); (* see Sect. 3 *) 

INDEXCOUNTER Xcoord QE (1,0), Ycoord OF (0,1); 

TYPE 

Point: RECORD sol, corr, resid, perturb: REAL END; 

RectRegion: RECORD xa, xz, ya, yz, num: INTEGER END; 
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VAR 

values: [MultiGrid] <Point> ; 

regions: [^] <RectRegion> ; 



5.3 The Basic Operations 

The solution of the small system of equations at the coarsest level (Step 1 in 
Sect. 4.2) is often done with a direct solver, and is not shown here. 

Residual Evaluation. The residual is evaluated according to (4). Besides of 
point-wise real additions and subtractions, the computation consists of the eval- 
uation of the discretized form of L and of the right-hand side /. The former is 
done by convolution by the stencil Lstencil, which is a power-type constant 
with index domain Z x given below as a cascaded conditional expression. 
The (scalar) function / is invoked multiple times (“lifted”) with an explicitly 
specified replication space appearing before it, and each instane of / accesses 
the correspondingly indexed elements of the power-type arguments, which con- 
sist in the symbolic integer coordinates Xcoord and Ycoord scaled by the mesh 
size h. 



CONST Lstencil = 

{(0,0)} => 4.0 : 

{( 0 , 1 ), ( 1 , 0 ), ( 0 ,- 1 ), (- 1 , 0 )} => - 1 . 0 ; 



r 1 

(* -1 4 -1 *) 

-1 



PROCEDURE compResidual (level, xa, xz, ya, yz: INTEGER); 

VAR h: REAL; 

BEGIN 

h := 1.0 / (2**level) ; 

(* evaluation of residual according to (4.): *) 

values [{level*down}©{xa+l . .xz-l}x{ya+l . .yz-l}] .resid : = 

(values . sol+values . corr) $$ Lstencil/ (h*h) 

- values. perturb 

- {level*down} $$ [{xa+1 . . xz-l}x{ya+l . . yz-l}] . f (Xcoord*h, Ycoord*h) 
END c ompRe s i dual ; 



Suioothiug. Smoothing is done by red-black relaxation, which combines good 
smoothing properties with good parallelism properties [16]. The term refers to 
the colouring of a grid in a chequerboard pattern: First, all “red” points are 
relaxed, which can be done in parallel, and then all “black” points, again in 
parallel (observe the two subdomains RedGrid and BlackGrid in the following 
program fragment). As usual, “relaxing a grid point” refers to the point-wise 
error smoothing by averaging that grid point with its neighbours according to 
the stencil involved, with taking into account the right-hand side at the same 
coordinates. 
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SUBDOMAIN 

RedGrid = { * OF (2,0) , (1 , 1) } ; (* spans all even-parity points *) 

BlackGrid = {(0,1)} © RedGrid; (* coset of RedGrid *) 

PROCEDURE smooth (level: INTEGER); 

VAR xa, xz, ya, yz : INTEGER; h: REAL; 

BEGIN 

xa := regions [level] . xa; xz := regions [level] . xz ; 
ya := regions [level] . ya; yz := regions [level] . yz ; 
h := 1.0 / (2**level) ; 
compResidual (level, xa, xz, ya, yz) ; 
values [{level*down}®RedGrid] . corr : = 
values. corr + values . resid*h*h/4. 0; 
compResidual (level, xa, xz, ya, yz) ; 
values [{level*down}®BlackGrid] . corr : = 
values. corr + values . resid*h*h/4. 0 
END smooth; 



Restriction. A customary and robust method for the restriction (coarsening) 
of residuals is “full weighting” [2,6]. Every point of the coarser grid gets assigned 
a weighted sum of several nearby points of the finer grid, and the weights are set 
up in such a way that all points in the finer grid — also the interleaving ones — 
have the same total sum of weights, i.e., the same “influence” on the coarser 
grid. 

CONST Restrictor = 

{-down} => 4.0/16.0 : 

{(0,1) , (1,0) , (0,-1) , (-1,0) {©{-down} => 2.0/16.0 : 
{(l,l),(-l,l),(l,-l),(-l,-l)}®{-down} => 1.0/16.0; 

PROCEDURE restrictResid (level : INTEGER) ; 

BEGIN 

values [{level*down}©PlaneGrid] .perturb : = 
values. resid $$ Restrictor; 

END restrictResid; 



The Remaining Steps. The remaining steps are explained only in passing. 

The interpolation is in principle a linear operator just like the restriction, 
expressed by convolution by a stencil. However, two details have to be taken 
into account: (i) on the boundary the solution candidate should be computed 
directly from the given boundary function F, and not by interpolation, (ii) Cubic 
interpolation — which is advisable in FMG for numerical reasons — requires “four 
points in a row”, but near boundaries and corners, these four points are not 
available in a symmetric distribution, i.e., two at either side. Therefore, different 
interpolation patterns have to be used near boundaries and corners. Both of 
these detail case discriminations can be expressed combining several assignments 
like (5) with different subdomains. 
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The procedure SetupRectangle (level) determines the corner coordinates 
of the domain rectangle of that level and stores them into the fields of the variable 
regions [level] . This is done either in accordance to 17 = (a, 6)^ at those levels 
where the entire domain 1? is to be considered, or according to a local error 
estimator at those levels where adaptive refinement is to be employed. The field 
regions [level] .num is set to 0 iff the refinement area is empty. 



5.4 The Main Program 

The main program implements the algorithm sketched in Subsect. 4.2. The al- 
gorithm begins at the coarsest level 0 and terminates at some finer level, viz. 
when the refinement criterion states that no further refinement is necessary. 

VAR level, depth, maxdepth, i: INTEGER; 

SetupRectangle (0) ; 

(* initial solution at level 0 (basic operation #1) . *) 

(* FMG multi-grid iteration: *) 
maxdepth : = 1 ; 

LOOP 

(* descend down to level maxdepth: *) 
level := 0; 

REPEAT INC (level); 

interpolate (level) ; smooth(level) 

UNTIL level = maxdepth; 
interpolate(level+l) ; 

SetupRectangle (level+1) ; 

IF regions [level+1] .num = 0 THEN EXIT (* from LOOP *) END; 
INC(maxdepth) ; (* for the next round *) 

(* ascend back to level 0: *) 

REPEAT DEC (level); 

restrictResid(level) ; smooth(level) 

UNTIL level = 0; 

FOR i:= 1 TO ... DO smooth(O) END; (*few more smoothings at lev. 0*) 
END (* LOOP *) ; 



6 Observations 

We summarize and generalize the key observations about the relations between 
numeric applications and high-level programming models: 

— Spatial discretizations with arbitrary refinements are modeled naturally 
by countably infinite-dimensional vector spaces. Problem-specific operators 
(e.g., differential, prolongation, and interpolation operators) often are linear 
operators on these vector spaces. 

A programming model that models such applications in terms of vector 
spaces and linear operators can be expected to lead to compact programs. 
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— The phenomenon of irregularity and dynamicity of spatial structures is 
banned from the semantics — there are no “irregular” vector spaces — and 
delegated to the system. 

— If the canonic bases for these vector spaces are chosen adequately, then 
the problem-specific linear operators correspond to “simple” (e.g., nearest- 
neighbour and/or translation-invariant) communication patterns. 

A programming model that provides index domains that reflect geometry 
and locality properties of the application can be expected to lead to efficiency 
on distributed-memory parallel machines. 

— In the case of geometric multi-grid discretizations, the interaction of grid- 
like and tree-like geometries in the same index domain can be modeled by a 
group with appropriate equality relations. 



7 Comparisons 

Here we confine ourselves to a few other programming models that are related to 
the modeling of spatial structure of parallel applications. For a broader survey, 
see for instance [15]. 



Other Models with Indexable Types. A now “classic” programming model 
that elaborates on indexable types is Crystal [3] . Crystal is a higher-order func- 
tional language with data fields over generalized index domains, such as grids, 
trees, and hypercubes, and data-field and index-domain morphisms. The seman- 
tic complexity of Crystal is considerably higher than that of Universe. 

Groups as index domains have also been proposed for the programming model 
8 ]/2 [5] . 8 ]/2 does identify the correspondence between generators of groups and 
basic neighbourhood structures (Cayley graphs), but does not further pursue the 
issue of non- Abelian groups and the identification of useful ones, and proposes 
their representation by libraries. 



More General Type Systems. There are other parallel programming models 
that employ inductive types or even more general settings for the modeling of 
spatial structure. As examples we mention the Bird-Meertens formalism [13] and 
NESL [1] for (join-) lists and Categorical Data Types [14] for polymorphic trees. 
A typical property of the category-theoretic approach is the inference of the 
container decompositions from the type constructors. There also is a category- 
theoretic understanding of shapes [8] (by which Universe simply understands 
patterns in structured infinite index domains). 

An abstract generic concept of capturing parallelism is that of algorith- 
mic skeletons [15]. Programs are composed from as few as possible predefined 
parametrizable building blocks (typically a small set of second-order functions), 
aiming at implementing parallelism as composition of pre-implemented internally 
parallel algorithmic fragments. Formally, also the power-type products and pro- 
cedure liftings of Universe constitute such a small set of second-order functions 
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that systematizes parallel access patterns for indexable container types. But in 
contrast to “plain” skeleton concepts, Universe encodes the knowledge about 
the problem geometry also — if not primarily — into index domains and the shapes 
of subdomains and operands — perhaps a sort of “geometry skeletons” . The free 
combination and interaction of these two concepts makes an implementation of 
Universe a demanding task and weakens its simplicity as a skeleton concept. 



More Technical Approaches. There are numerous approaches whose philos- 
ophy differs from the author’s in that they consist in implementation directives 
for some abstract machine, as opposed to the expression of structural informa- 
tion about the applications in the semantics. To this class belong data par- 
titioning/distribution algebras, languages, and systems, also High-Performance 
Fortran [7]. Another approach, a template concept for the modeling of irregular 
spatial structures, is given in [4]. 



8 Summary and Conclusion 

We have mentioned the structured-universe approach, a container-type concept 
based on structured infinite index domains. We have mentioned the known fact 
that groups as index domains are general enough to host grids as well as trees, 
and to formalize their different geometries under a unified scheme. We have ex- 
ploited this generality of groups as index domains further and have introduced a 
new kind of groups to host multi-grid algorithms. These groups reflect the multi- 
level nature in that grid-like and tree-like neighbourhoods interact in the same 
index domain. We have related this phenomenon to commutativity properties. 

This result sheds some more light on the little recognized versatility of (possi- 
bly non- Abelian) groups as spatial domains. Originally conceived as a unification 
of two different kinds of spatial structure, they generalize further to an “inter- 
polation” between these two. Together with the “structured-universe approach” 
— an abstraction scheme reminiscent of infinite-dimensional vector spaces over 
geometrically structured index domains — this new kind of index domains pro- 
vides an expressive formalization framework for adaptive multi-grid algorithms. 
Such formalizations are a prerequisite for the high-level programming of dis- 
tributed-memory machines by compact programs, and may constitute the input 
for an efficient automatic mapping onto such machines. 
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Abstract. This work presents a parallelization of a recursive decoupling 
method for solving tridiagonal linear system on distributed memory com- 
puter. We study the fill-in in the algorithm to optimize the execution of 
the scalar algorithm and to perform the communications. Finally, we 
evaluate the algorithm through specific test on the Fujitsu AP3000. 



1 Introduction 

In recent years considerable effort has been devoted to solve tridiagonal systems 
(TS), a very important class of linear systems which appear when the finite dif- 
ferential method is used to solve differential equations in partial derivates such 
as simple harmonic motion, Helmoltz. Poisson, Laplace and diffusion equations. 
The finite differential method involves the discretization of the differential equa- 
tion and subsequently the solution of the tridiagonal systems thus generated. 

There are many algorithms for solving TS, such as Gaussian elimination or 
LU elimination, that have proved to be the most effective sequential algorithms 
on serial computers. However, these algorithms cannot be directly adopted to 
parallel computers. Much research has been undertaken on parallel algorithms for 
solving TS. Hockney proposed the cyclic (odd-even) reduction (CR) algorithm in 
1965. Although originally proposed as sequential, this algorithm can be adapted 
to run on a wide range of parallel architectures [8,5]. In addition, new methods 
for increasing the parallelism of CR algorithm, such as PARACR [9] or radix-p 
CR algorithm [8], have been proposed. On the other hand, other well known 
strategies have been adapted to get new TS parallel algorithms, such as the one 
proposed by Egecioglu et al. [6] (recursive doubling strategy), Lin and Cheng 
[12](prefix), and Wang and Mou [17] and, Spaletta and Evans [16], which exploit 
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the parallelism of the divide-and-conquer strategy. Finally, a group of hybrid 
algorithms have been proposed that are based on partitioning the system into 
blocks of equations, using a local algorithm to reduce the subsystem in each block 
and a global algorithm to solve the reduced system. In this group we include the 
algorithms by Krechel, Plum and Stiiben [10], Cox and Knisley [4], Muller and 
Scheerer [15], Matton, Williams and Hewett [14] and Amodio and Brugnano [2]. 
In [1] we have classified the above TS algorithms in terms of their data flows 
and presented a unified parallelization on computers with mesh topology and 
distributed memory. 

In this paper, we consider the parallelization of the recursive decoupling al- 
gorithm by Spaletta and Evans [16] on a distributed memory multiprocessor. 
This algorithm has a very good behavior in terms of accuracy as the problem 
size increases and the partitioning process leads to independent systems. As sta- 
blished in previous works, the memory allocation requirement is demanding [16] 
and the execution times are not competitive with other partitioning methods [1]. 
In this paper we propose a technique to reduce the execution time of the scalar 
algorithm, minimize the memory requirements and to optimize the communica- 
tions in the parallel implementation. This technique exploits in the sparsity of 
the matrix obtained in the recursive fill-in process of the recursive decoupling 
algorithm. 

The rest of the work is organized as follows: in Section 2 we present the 
recursive decoupling algorithm by Spaletta-Evans. The parallel algorithm is pre- 
sented in Section 3. Experimental results on the Fujitsu AP3000 multiprocessor 
are shown in Section 4. Finally, in Section 5 we present the conclusions. 



2 The Recursive Decoupling Algorithm 

We consider a set of N linear equations with N unknowns 



Au = d, 

where A is a tridiagonal matrix A^ x of the form 



( 1 ) 



A = 



( bo Co 
oi bi Cl 
tt2 62 C2 






, with \bi\ > \ai\ + |ci|, Mi = 0, 1, ..., A^ - 1. 



V 



aN-2 bN-2 CN-2 I 

dN-l bN-1 / 



(2) 

With no loss of generality we will assume that the number of equations is a 
power of two. We will denote m = N/2 = 2"“^. 

The recursive decoupling algorithm is based on the recursive calculation of 
the inverse of matrix A by means of the Sherman-Morrison formula [7] . To this 
goal, we decompose the matrix A (2) as follows: 
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where 



and 



eo = bo 

6n- 1 = bN-l 



e2j-i — b2j-i — 0.2 j 
C2j = b2j — C2j-1 



when j = 1 , . . 



m — 1. 



( 4 ) 

( 5 ) 



In expression (3), all the elements in the vector columns and have only 
two non-zero elements at the positions 2j — 1 and 2j, that is 



yO) = (0, •••,0,a2j,C2j-l,0, •••,0)^ 

In matrix notation, the partitioning of A given in equation (3) is denoted as 

m— 1 

A = J -\- ^ (7) 

i=i 

where J is the 2x2 block diagonal matrix on the left in equation (3). 

The basic idea, underlying the choice of this particular partitioning, is given 
by the Sherman-Morrison method. Sherman-Morrison proved that, given two 
N X N matrices A and J such that A = J + x ■ , the inverse of matrix A can 

be obtained by the formula 

= J-i - a(J-ix)(y^J-i), a = (8) 

To compute directly the inverse of matrix A would cost 0{N^) arithmetic opera- 
tions, while the use of formula (8) only implies 0{N“^) operations. When applied 
to solve a linear system of equations Au = d, the solution will be 

u = = (/ — aJ“^xy^) J“^d. (9) 

This process avoids the explicit computation of the inverse matrix. 

The Recursive Decoupling method, described in [16], derives the solution of 
system (1) by considering that A= J -b x(-^4yO')T _|_ ^ then 
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applying the Sherman-Morrison formula (7) to matrices A and J + Yl'jLi 
The recursive procedure is as follows 

Mh = + = (/ - M,,_2 

an-i = 1/ (1 + y('*-i)^M^_ix('*-D) . 

Index h goes from 1 to m — 1, Mg being the matrix J~^ and the last matrix 
Mm-i will be A~^. Let us denote as g^’^~^'>=Mh-iXh- Observe that these vectors 
are needed to obtain the recursive formula (10) and can be computed using a 
similar recursive method 

g(h) = (/ - a,i_igO-i)y('*-i)T^ Mh-2^h 

'h-i 1 (11) 

= Y\.(I - 

_i=i 

In order to obtain the final solution u = A“^d, from (10) follows a recursive 
formula similar to (11) 

J“M. (12) 

Then we need to carry out the following steps, 

Step 1 In this step the matrix J~^ is calculated, as well as the product J~^d, 
the initial value of u. Given the shape of matrix J, its inverse may be 
obtained by calculating the inverse of each 2x2 block Jj , 



m— 1 

u = A^^d = (/ — 
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Step 2 Compute the initial vectors for indices j = 1, • • • ,m — 1. 

Because of the pattern of the vector , the vector g^^^ has only non- 
zero elements from 2j — 1 to 2j + 2 positions, 

0 
0 

-C2(i-l)/^2i-l 
e20-l)/^2i-l 
e2j-ilA2j 
-a2j-\!A2j 
0 

0 

Step 3 In this last stage, vectors u and are updated with the use of equations 
(11) and (12). This rank-one updating procedure, which also make use of 
the particular shape of vectors and can be described as follows: 

for A; = 1, 2, • ■ ■ , n — 1 

for j = 2'‘-\ 2"-i -2'=-!, 2'“ 

= 1/(1 + y^('’’^g‘'^^) 
u = (/ — u 

for i = 2^ 2"-i - 2^ 2'“ 

gb) = (7 - ajgblyb)^) gW 

end 

end 

end 

3 The Parallel Recursive Decoupling Algorithm 

In this section we propose some modifications to the above sequential algorithm 
in order to reduce storage and execution time. Then, we propose a parallelization 
of the algorithm. 

Note that in step 2, when we calculate glU, (0 < j < m — 1), the initial 
vectors x^A only contain 2 non-zero elements. Therefore, at the 1st iteration the 
vectors g^A are composed of 4 non-zero elements and, in general, at iteration k, 
g*-!^ is a vector with 2^“i^ non-zero elements, namely components from 2 j -|- 1 — 2^ 
to 2j-b2^ 

Observe at the example in Fig. 1 that to compute vector g^®l we do not need 
all the g^l^ vectors in each iteration k. In fact, are needed only those vectors g^l^ 
which have elements different from 0 just at row i, where column i of matrix 
(^7 - has also elements different from 0. It can be easily proved 

that this happens if 2^+^ < J + 2^“^. Then, the internal loop i in the 

step 3 of the recursive procedure can be simplify as follows 
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begin=2'“+i[5S^J 

if( begin == 0 ) then begin = 2'° 

for i = begin-, j + 2'° 

gW = (/ - g*'*’ 

end 

On the other hand, the fill-in process only occurs at several points of the 
algorithm, where the values associated to specifics are computed. These 
values are calculated using the recursive tree procedure described in step 3. For 
example, in Fig. 2, at the iteration 1, j has the values {1, 3, 5, 7}, at the iteration 
2 has the values {2, 6} and at the iteration 3 has the value {4}. In addition, the 
vectors and g*-^) are used only at the 1st iteration and during 

the execution of the algorithm keep at most 4 non-zero elements. Similarly, 
vectors g^^^ and g^®) are used until the 2nd iteration and the number of non- 
zero elements is less than 8, and so on. As a consequence, not all the vectors gd) 
perform the fill-in procedure in the same way. We take advantage of this fact to 
gather the non-zero elements, then saving memory. Instead of arrays ][ ] of 
size N/2 X N/2, we have arrays of size (n — 1) x N/2, where n = log 2 N. At the 
stage 2 in Fig. 1 we can see how the vectors g^^^ are stored for the case N = 16. 
Memory savings is (2”“^ — n + 1)2”“^. 

Concerning the parallelization of the algorithm, Fig. 1 summarizes the dif- 
ferent stages by means of an example {N = 16 equations on 4 PEs). In this 
algorithm, the computation of the initial steps is distributed among all the pro- 
cessors. Therefore, the process of partitioning matrix A, given in (7), as well as 
the distribution of vectors u and d is referred as preliminary stage. At this stage, 
communications of the C(jv.i)/p-i occur from processor i to processor i-\-l and, 
for the a^j^.iyp, from processor i -|- 1 to processor i, where z = 1, • • • , P — 1, P 
being the number of processors (see Fig. 1). 

After the preliminary stage, steps 1 to 3 are performed. Having in mind the 
block diagonal structure of matrix J, step 1 may be computed concurrently in 
all the processors without any communication, since the m subsystems in (13) 
can be solved in parallel. The same happens at stage 2, but in this case the 
TO — 1 subsystems in (15) are to be solved. Some vectors g^^^ are distributed 
among two processors. But this does not imply any communication since each 
processor calculates the components of the vector using local data. As an 
example, in Fig.l, the components {2,3} of vector g*^^) are in processor 0 and 
the elements {4,5} in processor 1. This distribution of vector g^^'> provides a 
better load balance. 

At stage 3, no communication is required during the first n—p— 1 iterations. 
However, the last p iterations require communications since the i — th element of 
vector u must be transferred to all the processors containing elements of the i—th 
column of the matrix (/ — which are different from 0. In addition, 

the k — th element of vector g^^'> must be transfered to processors which contain 
elements of column fc of (/ — different from 0. 
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Preliminary stage 



u u 





Fig. 1. Scheme of the parallel algorithm for = 16 equations and 4 processors. 
We denote as x the elements different from 0 either in vectors and matrices. 
Circles indicate data to be transferred and arrows point out destination proces- 
sors. At stage 2, computation of g^^^ from and g is summarized. At stage 3, 
the Figure shows how g^^^ for k = 1 and k = 2 are calculated. 
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Fig. 2. the vectors are calculated at each iteration of step 3 for TV = 16 
equations. 

4 Evaluation 

The recursive decoupling algorithm has been implemented on the Fujitsu AP3000 
distributed memory computer [13] using the message passing programming model. 
We have used the MPI programming environment. To verify the performance 
of the parallel algorithm, we used a test diagonal system (with know solu- 
tion), whose coefficients matrices satisfy the condition, \bi\ > joil -I- jcij, Vz = 
0, 1, ..., iV — 1. This test is described below. 



(16) 



whose exact solution is an TV-dimensional vector u with components: 

-I- 1 - 7 

Ui= , Vz = l,...,TV. (17) 

The experiments were performed on matrices of size ranging from 16384 (2^^) 
to 1048576 (2^°) for the test (16). As we can see in Table 1, the increasing number 
of processors produces a reduction in the execution time of the algorithm. We 
observe that this method presents a high efficiency for all the sizes of equations. 

In Fig. 3. a, we show the efficiency of our modified sequential algorithm (T^) 
with respect to that of the initial algorithm (To)- It can be observed a per- 
formance {{To — Tra)/To) increase of more than 90% for any value of N . On 
the other hand, in Fig. 3.b we show the efficiency for the parallel algorithm for 
some values of parameter N. Efficiency {{Tm/{P ■ Tp)was calculated using the 
execution time of the modified sequential code. The parallel algorithm exceeds 
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Table 1. Execution times in seconds measured on the AP3000 for different 
number of processors. The size of matrices are from 16384 (2^^) to 1048576 
( 220 ). 



p 


214 


215 


216 


2i" 


218 


219 


220 


1 


0.1919 


0.4231 


0.9098 


1.9969 


4.3576 


9.2092 


19.3657 


2 


0.0791 


0.1830 


0.3561 


0.7990 


1.7162 


3.9248 


7.7731 


4 


0.0396 


0.0848 


0.1866 


0.3813 


0.8379 


1.9117 


3.9418 


8 


0.0203 


0.0427 


0.0897 


0.1897 


0.3988 


0.9471 


1.9001 



the ideal speedup due to an efficient use of local memories and the communica- 
tion optimization. Therefore, these results prove that the techniques employed 
to parallelize the algorithm permit to obtain a good performance on distributed 
memory computers. A last observation is that our parallel program is scalable. 
That is, in order to maintain a constant efficiency, N grow at the same rate as 
P, which we just observed in Fig. 3.b. We have experimentally checked that the 
results are very similar to the ones obtained for smaller matrices. 




Number of processors 



Fig. 3. (a) Efficiency of the modified sequential algorithm we propose related to 
the initial algorithm, (b) Efficiency of the parallel algorithm on the AP3000 for 
N = 215 and N = data. 



It is difficult to make a comparison with other implementations of the Re- 
cursive Decoupling Method for Solving Tridiagonal on other machines, but the 
speedup may be compared with those presented in [16,3]. Their numerical re- 
sults are obtained in the Balance 8000 multiprocessor system. The maximum 
speedup is 2.1075 with = 512 and P = 8. Climent et al. [3] present theoretical 
predicted times for their algorithm on a Cray T3D. According to the efficiency 
results we can conclude that our algorithm presents a significant improvement 
in terms of performance. 






Parallelization of a Recursive Decoupling Method 353 



5 Conclusions 



In this paper, we have propose a parallelization of the recursive decoupling 
method for solving tridiagonal linear systems on distributed memory computer. 
The method showed an optimization of the memory requirements, a superlinear 
speedup and scalability. The memory savings comes from a compressed storage 
policy which eliminates the null elements. On the other hand, we study the fill-in 
in the algorithm to optimize the execution of the scalar algorithm. In this way, 
the performance increases more than 91% for any value of N. 



References 

1. Amor, M., Lopez, J., Argiiello, F., Zapata, E. L.: Mapping Tridiagonal System 
Algorithms onto Mesh Connected Computers. International Journal of High Speed 
Computing 9 (1997) 101-126 

2. Amodio, P., Brugnano, L.: The Parallel QR Factorization Algorithm for Tridiagonal 
Linear System. Parallel Computing 21 (1995) 1097-1110 

3. Climent, J.-J., Tortosa, L., Zamora, A.: “A Recursive Decoupling Method for solving 
Tridiagonal Linear System in a BSP Computer”. Proceedings in X Jornadas de 
Paralelismo (1999) 73-78 

4. Cox, C. L., Knisley, J. A.: A Tridiagonal System Solver for Distributed Memory Par- 
allel Processors with Vector Nodes. Journal of Parallel and Distributed Computing 
13 (1991) 325-331 

5. Dodson, D. S., Levin, S. A.: A Tricyclic Tridiagonal Equation Solver. SIAM J. 
Matrbc Anal. Appl. 13 (1992) 1246-1254 

6. Egecioglu, O., K 05 , (J. K., Laub, A.J.: A Recursive Doubling Algorithm for Solu- 
tion of Tridiagonal System on Hypercube Multiprocessor. J. of Computational and 
Applied Mathematics 27 (1985) 95-108 

7. Golub, G. H., Van Loan, C. F.: Matrix Computations. The Johns Hopkins University 
Press (1989) 

8 . Groen, P. P. N. de: Base-p-Cyclic Reduction for Tridiagonal System of Equations. 
Applied Numerical Mathematics 8 (1991) 117-125 

9. Hockney, R. W., Jesshope, G. R.: Parallel Computers. Adam Hilger (1988) 

10. Krechel, A., Plum, H.-J., Stiiben, K.: Parallelization and Vectorization Aspects of 
the Solution of Tridiagonal Linear System. Parallel Computing 14 (1990) 31-49 

11. Lin, F. C., Chung, K. L.: A Cost-Optimal Parallel Tridiagonal solver”. Parallel 
Computing 15 (1990) 189-199. 

12. Lin, W.-Y., Chen, C.-L.: A Parallel Algorithm for Solving Tridiagonal Linear Sys- 
tems on Distributed-Memory Multiprocessors. International Journal of High Speed 
Computing, 6 (1994) 375-386 

13. Ishihata, H., Takahashi, M., Sato, H.: Hardware of AP3000 Scalar Parallel Server. 
FUJITSU Sci. Tech. J. 33 ( 1 ) (1997) 24-30 

14. Mattor, N., Williams, T. J., Hewett, D. W.: Algorithm for Solving Tridiagonal 
Matrix Problems in Parallel. Parallel Computing 21 (1995) 1769-1782 

15. Muller, S. M., Scheerer, D.: A Method to Parallelixe Tridiagonal Solvers. Parallel 
Computing, 17 (1991) 181-188 




354 Margarita Amor et al. 



16. Spaletta, G., Evans, D. J.: The Parallel Recursive Decoupling Algorithm for Solving 
Tridiagonal Linear Systems. Parallel Computing. 19 (1993) 563-576 

17. Wang, X., Mou, Z. G.: The Parallel Recursive Decoupling Algorithm for Solving 
Tridiagonal Linear Systems. Proceedings of the third IEEE Symposium of Parallel 
and Distributed Processing (1991) 810-817 




A New Parallel Approach to the Toeplitz Inverse 
Eigenproblem Using Newton-like Methods 



Jesus Peinado', Antonio M. Vidal^ 

Departamento de Sistemas Informaticos y Computacion 
Universidad Politecnica de Valencia. Valencia, 46071, Spain 
Phone: +(34)-6-3877798. Fax: +(34)-6-3877359 
' jpeinado@dsic .upv. es 
(Author in charge of correspondence) 
^avidal@dsic . upv . es 



Abstract. In this work we describe several portable sequential and parallel 
algorithms for solving the inverse eigenproblem for Real Symmetric Toeplitz 
matrices. The algorithms are based on Newton's method (and some variations), for 
solving nonlinear systems. We exploit the structure and some properties of Toeplitz 
matrices to reduce the cost, and use Finite Difference techniques to approximate the 
Jacobian matrix. With this approach, the storage cost is considerably reduced, 
compared with parallel algorithms proposed by other authors. Furthermore, all the 
algorithms are efficient in computational cost terms. We have implemented the 
parallel algorithms using the parallel numerical linear algebra library SCALAPACK 
based on the MPl environment. Experimental results have been obtained using two 
different architectures: a shared memory multiprocessor, the SGI PowerChallenge, 
and a cluster of Pentium II PC's connected through a Myrinet network. The 
algorithms obtained show a good scalability in most cases. 



1 Introduction and Objectives 

In this work we describe several portable sequential and parallel algorithms for 
solving the inverse eigenproblem for Real Symmetric Toeplitz (RST) matrices. These 
matrices appear in several numerical problems in physics and engineering. There are 
many references related to solving Toeplitz linear systems, however references related 
to the Toeplitz inverse problem are limited [6]. The use of parallel computers is of 
interest because of the cost of solving this problem. 

The algorithms presented in this paper are based on Newton’s method, (Newton, 
Shamanskii and Chord methods, and the Armijo Rule) [10], for solving large scale 
general nonlinear systems. We exploit the structure and some properties of the 
Toeplitz matrices to reduce the cost. We use finite difference techniques [10] to 
approximate the Jacobian Matrix. Our idea is to use as standard a method as possible. 
Our approach for solving this problem as a general nonlinear system is different from 
other “state of the art” sequential [19] and parallel [3] algorithms. Our algorithms 
considerably reduce storage cost, and thus allow us to work with larger problems. 
Furthermore, our algorithms are efficient in terms of computational cost. 
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In our experiments we have considered 15 test problems detailed in [2] [16]. Each 
problem has a different pattern (spectrum) of eigenvalues. To compare our approach 
with other nonlinear solvers we use Powell’s method, implemented in the MINPACK- 
1 [15] standard package. Powell’s method is a robust general purpose method to solve 
nonlinear systems. 

All our algorithms have been implemented using portable standard packages. In the 
serial algorithms we used the LAPACK [1] library, while the parallel codes make use 
of the SCALAPACK [4] and BLACS [20] libraries on top of the MPI [18] 
communication library. All the codes have been implemented using C++. 

Experimental results have been obtained using two different architectures: a shared 
memory multiprocessor, the SGI PowerChallenge, and a cluster of Pentium II PC’s 
connected through a Myrinet [14] network. However other machines could be used 
due to the portability of the packages and our code. 

On both machines we achieved good results and show that our algorithms are scalable. 
We want to emphasize the behaviour of the algorithms using the cluster of PC’s and 
the Myrinet network. This system is a good, cheap alternative to more expensive 
systems because the ratio performance to price is higher than when using classical 
MPP machines. 



2 The Problem 

Let 1 where , are real numbers, and letr(?) be the real 

symmetric Toeplitz (RST) matrix: 

We say that t generates T{t ) , and denote the eigenvalues of T{t) by: 

The inverse problem [19] for RST matrices can be described as follows: 

Given n real numbers A, < A, <...< A , find a «-veetor t such that: 

A.(i) = A., 1 < 1 < n . 

We will call A^ < A^ <...< A^ target eigenvalues, and A = [A^, Aj,...,A^] the target 
spectrum. 

We will use nonlinear system techniques to find t = J^where ^ 

are the RST matrix coefficients. Thus, f*”’ is the starting point and we construct a 
sequence using an iterative process which converges towards 

f*'* = *'*„-! ] : the target spectrum eigenvalues. 
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2.1 Newton’s Method 

We used Newton’s method (and some variations) to solve our nonlinear system, 
because this method is powerful and it has quadratic local convergence [10]. Newton’s 
method [10] is based on the following algorithm, where J is the Jacobian Matrix and F 
is the function whose root is to be found {k is the iteration number): 

J{x')p = -F(x‘), J{x) G iR™, 

p‘, F(x‘)e9t" . 

We also used several variations to Newton’s method: if we only compute the Jacobian 
matrix and factorize it in the first iteration, we use the Chord method. If we compute 
the Jacobian matrix and factorize it only at given iterations, this is the Shamanskii 
method. These changes reduce the time cost of Newton’s method, because far fewer 
Jacobian evaluations and factorizations are performed, however convergence is q- 
linear [10]. This can be shown better if we write the transition from x‘to x‘*‘: 

=x*-J(x*) 'f(x*) 

T /+1 = T; - )“' F{x ) for 1 < 7 < w - 1, 

k+\ 

^ ^ym- 

Note that m = \ is, Newton’s method and m = °° is the Chord method. Other values 
of m define the Shamanskii method. These methods are frequently used for very large 
problems. 

In order to improve convergence, we perform a linear search using the Armijo’s rule. 
This allows for a global convergence of the Newton’s method (otherwise Newton is 
only locally convergent) [10]. This improvement is necessary to allow us to reach 
convergence in some cases. 



2.2 Adapting Newton’s Method to the Inverse Toeplitz Eigenproblem 

We now describe the Newton’s method when applied to the inverse Toeplitz 
eigenproblem: 

The starting point (x °): we used two different starting points. Both of them are 
experimental and depend on the problem to be solved (the target spectrum). 



Normalized [3] Laurie[12]: 

1 

I - if i-\ 

V2(«-l) 

0 






if 
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or Trench[19]: 

1 

if i is odd 

t - \ M f with M = 

1 fj n 

0 if i is even 
The function F (x^\. is the value of the function at the k iteration. 

Let t be the vector that generates T{t) . 

Let A = [Ij, the target spectrum, then F{x), x e 5K", is defined as 
F{x) ^ eig{T{x))- k. 

where 

eig{T{x)) = [A,(x),A^(x),...,A„(x)]^ 

The computational cost of computing the eigenvalues of a large matrix is high. We can 
exploit here some of the properties of the Toeplitz matrices. If we use Cantoni and 
Butler’s theorems [5], we can obtain the eigenvalues of the matrix from the 
eigenvalues of two matrices half its size. 

The Jacobian matrix J ix^Y is the value of the Jacobian matrix at the k iteration. We 
must compute it with the forward difference approximation technique: 

( X ) ~ 

ox, h 

This technique produces a great increase in the time cost, because F{x) has to he 
computed once per iteration, and, F(x + he, ), j = 1,2. . . n must be computed once per 
column j of the Jacobian matrix. This is the most time consuming step in our 
algorithms. However, this cost can be alleviated slightly as the entries of the first 
column do not have to be computed because they are equal to 1 ; 

eig{T{k^ +he^))-eig{T{f‘k) _ eig{T{k‘k) + h- eig{T{kk) _h 
h h h 

Alternative techniques to construct the Jacobian matrix can be found in [12] and [19], 
but those techniques imply the construction and storage of the eigenvectors of an RST 
matrix. The use of the difference approximation alleviates the computational and 
storage cost of this step. 

Storage cost is determined hy the costs of computing the eigenvalues, and solving a 
linear system. The cost is the same for the three methods: 1) Storing the matrices to 
compute F, and 2) Storing the Jacobian matrix and the linear system. The cost of 1 
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consists of storing two half sized problem matrices, and the cost of 2 is storing one 
problem sized matrix. 

The linear system : the Jacobian matrix has no special structure or property. We used 
the LU solver and forward and backward substitution for solving two triangular 
systems [9]. 



2.3 The Sequential Algorithm 

The sequential algorithm uses all the techniques described in the previous Section. In 
order to design a code as efficient as possible, we make use of the LAPACK library. 
The Newton algorithm for the inverse Toeplitz eigenproblem is the following: 



The Newton sequential Algorithm (for the Inv. Toeplitz 
eigenproblem) 

Choose a starting point x° 

Compute vector F{x'‘) : 

V F(x‘) = e/g(r(x*)) (* F(x*) + A *) 

While the stopping criterion has not been reached 

Compute Jacobian Matrix J(x‘) : 

If column j=l 

j(x‘).=[i,i...ir 

else 

For j =2 : n 

w <— F(x'‘ +he.) = eig(T(x'‘ +he.)) (*T’(x* + he.) + A*) 




Solve the linear system /(x*)^* = -/^(x‘) : 
Factorize J{x'‘)-LU 
Solve LUs’' =-F{x’") 

Update the iterate x*=x*+s'* 

Compute vector F{x'‘) : 

V F(x‘) = eig(r(x*)) (* F(x*) + A *) 



The Chord and Shamanskii Methods have similar algorithms. 
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Note: when computing the Jacobian, we add and subtract the target spectrum, 

eig{T{f‘^ + /te. )) - A - )) + A + he ^ )) - eig{T{t ^‘^ )) 

h h 

then, it is no necessary to subtract A when storing F{x) and F{x + he ), j = 1,2. , , n on 
the variables. For example: 

V <— F(x‘) = eig{T{x’')), but this expression is really: F(x*) + A 

The computational cost of the sequential algorithm will depend on the iteration cost. 
The algorithm uses routines to: compute the eigenvalues, add two vectors, solve a 
linear system, compute vector norms and merge two vectors. The final expression is 
as follows: 



Newton 


Chord 


Shamanskii 


1 

+ 

1 


1 

+ 

1 


K 


/ 4 3 V 

n n 

V m — 


3 3 ) 


' 3 3 




1 3 3 ) 



where k^,k^ and k^ are the respective iterations for Newton’s, the Shamanskii and 
Chord methods. A more detailed analysis can be found in [16]. 



The storage cost is the same for the three methods: 1) Storing the matrices to compute 
F, and 2) Jacobian matrix and the linear system. The cost of 1 consists of storing two 
half sized problem matrices, and the cost of 2 is storing a matrix of size equal to the 
problem: 




3 Parallel Algorithm 



3.1 How to Parallelize the Sequential Algorithm 

The parallel version uses the SCALAPACK library. Within an iteration, the 
computation of the Jacobian, the solution of the linear system, and the update of the 
iterate are parallelized. 

The SCALAPACK library uses a 2-D block cyclic data distribution. We show in Fig. 
1 , an example of such a distribution: we consider a 9-by-9 matrix distributed over a 2 
X 3 grid of processors before performing its LU factorization. 
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Fig. 1. SCALAPACK block cyclic distribution for the LU algorithm and the right hand side. 



A standard SCALAPACK distribution could be as follows: distributing a matrix of Mrows x N 
columns, partitioned in MB x NB sized blocks on a 2-D processor mesh with P processors. The 
mesh size is row processors by P^ column processors. The (i,j) entry is located on the 
processor as follows: 

iPr ’Pc)~ [((* “ 1) > (0 “ 1) div NB) mod P ] 

0<p^<P-l, 0<p^<P-l, 

\<i<M, \<j<N. 

Analogously on (p^,p^) the entries (i,j) are loeated: 



(i, j) = {x* MB* P+ p^* MB + k , y* NB* P^+ p^* NB + l ) 



X = 0. . . 



y = 0... 



^ M ^ 
yMB*P J 
^ N ^ 
yNB*Pj 



-1 



-1 



k = \...MB 
I = \...NB . 



For the right hand side vector we must reference the indexes showing the row entries. 

Our algorithms are designed for efficient implementation of the computation of the 
Jaeobian matrix J, because it is the most time consuming step. First, we need to apply 
the forward difference approximation formula to compute the Jacobian matrix. To do 
this F(x) must be replicated on eaeh processor. Therefore F is always computed 
sequentially in each processor. As F{x) is available, we only need to compute 
F{x + /le ) with j = 1,2. ..n. Our suggestion for performing this computation efficiently 
is the following: each column of processors in the logical mesh provided by the 
SCALAPACK package, is in charge of computing a set of columns in the Jacobian 
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Matrix. For example Fig. 2 shows that, processors (p^,Pq) must compute the columns 
1,2, 7, 8, processors ) must compute the columns 3,4,9 and so on. 



0 12 
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Fig. 2. Computing the Jacobian in parallel. 



In addition, the work corresponding to a column of processors is divided among the 
processors in that column. Thus, in our example, processor (pq,Pq) computes 
columns 1,2 and processor computes columns 7,8. The same idea is applied to 

all the processors. Finally, a local communication between processors must be carried 
out in the same column, in order to achieve the adequate distribution of the Jacobian 
matrix. 

Once the Jacobian Matrix has been computed, we need to solve the linear system and 
update the iterate. For solving the linear system, we used the PDGESV (and some 
variations) SCALAPACK routine. The distribution of the elements is shown in Fig. 1. 
When the system is solved, the solution s is stored into the right hand side vector. 
Then we only have to broadcast j to all the processors, and update the iterate. 

For broadcasting s, we used two calls to the BLACS: 

1 . All the first column processors must have the complete 5^*^. 

2. Each first column processor sends the complete vector to the processors located in 
its same row. 



With these steps the vector is located in all the processors, and we only have to update 
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The algorithm for each processor (P^,PJ of the grid is the following: 

The Newton Parallel Algorithm (for the Inv. Toeplitz 
eigenproblem) 

Choose x° (* same on all the processors *) 

Repeat 

Compute F using: 

F{x) - eig(T(x)) (* F(x*) + A *) 

Compute Jacobian Matrix J(x‘) : 

For the (n/P^) columns of {P^,PJ 
If column J = l e(p^,pj 

J(x‘), =[U,...,1] 
else 

w F(x'‘ + he.) - eig(T(x'‘ + he.)) (* F(x'‘+he.) + A *) 




Exchange the (n ! P^) rows with the rows belonging 
to the processors {i,p^) i-Q...P^-\, i ^ p^ 

Solve the linear system J(x'‘)s'‘ - -F{x'‘) : 

using the SCALAPACK'S pdgesv ( J(x*),-/^(x‘),5*,/)^,/)J 

Update the iterate: 

If (p^=0) ( * column 0 processors *) 

Update the subvector x adding the subvector s 

Broadcast the subvector x 
else 

Receive the updated subvector x 
until the stopping criterion has been reached 

The complete computing cost (T ) for all the algorithms can by obtained by adding 
the arithmetic cost (J) plus the communication cost (J). We call tf the time for 
performing one floating point operation, P the time to prepare a message (latency), 
and T the cost of sending on the data item [4][8]. This gives us the following 
expression to send a message composed by n data items: 



- P + nT 
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For our algorithms the complete computing costs are: 

Newton 



T =k, 

p A 



/ 4 3 3 \ 

n n n 

1 1 

\ 3P 3F 3 J 



+ {n^ + «Vp)t + (« logj P + n-Jp)!i 



Chord 



T = 



/ 4 , 

' n k n 



^ + {n + 2kn~J~P^T + {n log^ P + kn^Tp^fi . 



Shamanksii 

T =k 



f 4 

' n 



3 3 

n mn 

+ — + 

l\3P 3P 3 ) 



+ {n^ + 2mn-\fp^T + (« log^ P + mn 



A more detailed cost analysis can be found in [16]. 

The storage cost is the same for the three methods: 1) Storing the matrices to compute 
F, and 2) Jacobian matrix and the linear system. The cost of 1 consists of storing two 
half sized problem matrices replicated in each processor, and the cost of 2 is storing 
one problem sized matrix: 

2 (P + 2)n^ 

Total cost =2 —\P + n = ; Cost per processor = 1 . 

\2J 2 P 2 



This cost improves that of the algorithms in [12] and [19] because we do not need to 
compute and store the eigenvectors. 



4 Experimental Results 



4.1 Tests Problems 

We show here a brief study of the performance of the parallel algorithm: we used a 
group of 15 problems [16]. Each problem consists of a different kind of spectrum. The 
Irsrt three types of spectrum are generated randomly, following some statistical 
distributions. The other 12 types correspond to the eigenvalues of tridiagonal matrices 
used as test matrices in several papers [13] [3]. In the latter 12 spectra we can 
distinguish between the first 7, where the elements and the spectra are generated using 
well defined formulas, and the last 5, where the matrices and the spectra are generated 
using LAPACK’s dlatms [7] routine. 
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We chose here type 4 of the 15 test problems and applied the Newton’s Parallel 
method. We used 6 different problem sizes N=200, 256, 400, 800, 1200, and 1600. 




Fig. 3. Speedup figures corresponding to the SGI (left) and the cluster of PC’s (right). 

We have proceeded to experiment on two different machines: a SGI multiprocessor 
with 10 processors MIPS RlOOOO/195 MHz, and a cluster of 20 Pentium 11/300 MHz 
PC’s with a Myrinet network. The figures (Fig. 3) show the speedup of our algorithms. 
Speedup was obtained with respect to the Chord algorithm because it performs better 
than the Powell’s method standard algorithm, used in the MINPACK-1 package [15]. 
The good performance obtained on both machines can be clearly seen. In both figures 
we are near the theoretical maximum speedup. The performance is good even for 
small problem sizes. 

A scalability study was also carried out. Fig. 4 shows the scaled speedup [1 1]. N is the 
initial size for each case and is increased by a factor k(p) when increasing the number 
of processors p (see [11]). For example for the N=200 case, the successive sizes will 
be 238,282,313 




Fig. 4. Scaled speedup corresponding to the cluster of PC’s. 
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Fig. 4 shows that our algorithms are scalable. Performance is specially good for 
problems where the initial sizes for the scalability test are small and medium. For large 
initial size scalability decreases slightly. 



4.2 Performance Evaluation Compared with the Theoretical Model 

In this section we carried out a theoretical performance analysis for the parallel 
algorithms. We used the theoretical costs shown in the former sections and a machine 
analysis to parameterize our machine. The analysis consists of obtaining the 
parameters to characterize the machine {tf,T ,{3), and with these parameters, the main 
goal is to obtain the theoretical behaviour of our algorithms. In our case we performed 
the analysis using a network of computers: standard PC's and a Myrinet network, and 
all our algorithms. 

To obtain tf(\hs flop time) we could use a standard routine of any of the sequential 
libraries, but this time varies too much between different routines, that is to say, for 
different algorithms we will obtain different flop times. Another more accurate 
possibility consists of obtaining tf using our sequential algorithm. With this analysis 
the flop time is tf = 0.018 microseconds. 

To obtain the communication time, we used the double Ping-Pong algorithm where 
one processor sends several different sized messages to another, which then returns the 
messages. The measured time in this operation is half that required to send and give 
back each message. Sending the minimum sized packages we can obtain the value of 
P , while sending the maximum sized messages we can obtain the value of T . The 
value obtained for the latency p is 33 microseconds , and the value for T is 0.03 
microseconds . 

We included this study firstly, to test if the theoretical model developed before is 
good, and secondly, to be able to predict the behaviour of the algorithm on a 
computer. 

We can compare the figures corresponding to the theoretical and experimental 
speedup models. It can be appreciated that the two figures are very similar. The only 
small difference corresponds to small size matrices (200 and 256). In addition the 
theoretical speedup is slightly better than the experimental. This is normal. 

This results give us confidence in our theoretical model. In principle such a model 
could be good to predict the behaviour of the algorithm on the computer, changing the 
size of the problem and/or the number of processors. 
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Processors Processors 



Fig. 5 and 6. Comparing the theoretical speedup model (left) and the experimental speedup 
(right). 



5 Conclusions 

We have developed a new approach for solving the inverse eigenproblem for RST 
matrices. Our method has several advantages with respect to “state of the art” 
algorithms. We solve the problem as a general nonlinear system, using the difference 
approximation technique to approximate the Jacobian. This gives a more general 
perspective on this problem. We have also managed to reduce storage cost , which 
allows us to work with larger problems. Furthermore, our parallel algorithm is 
efficient when working with small and medium sized problems. 

With respect to the theoretical model, we think the model is quite close to the 
experimental model. This is very important because with such a model in principle we 
can predict at the behaviour of any algorithm using the parameterized machine (in our 
case the Myrinet network). Finally, we note the behaviour of the Myrinet network. We 
think it could be a good, cheap alternative to classical MPP machines. 
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Abstract. An efficient parallel algorithm, which we dubbed farm- 
zeroinNR, for the eigenvalue problem of a symmetric tridiagonal ma- 
trix has been implemented in a distributed memory multiprocessor with 
112 nodes [1]. The basis of our parallel implementation is an improved 
version of the zeroinNR method [2]. It is consistently faster than simple 
bisection and produces more accurate eigenvalues than the QR method. 
As it happens with bisection, zeroinNR exhibits great flexibility and al- 
lows the computation of a subset of the spectrum with some prescribed 
accuracy. Results were carried out with matrices of different types and 
sizes up to 10“^ and show that our algorithm is efficient and scalable. 



1 Introduction 

The computation of the eigenvalues of symmetric tridiagonal matrices is one of 
the most important problems in numerical linear algebra. The reason for this 
is the fact that in many cases the initial matrix, if not already in tridiagonal 
form, is reduced to this form using either orthogonal similarity transformations, 
in the case of dense matrices, or the Lanczos method, in the case of large sparse 
matrices. 

Essentially we can consider three different kinds of methods for this problem: 
the QR method and its variations [3], [4], the divide-and-conquer methods^ [5], 
[6] , and the bisection-multisection methods [7] , [8] , [3] , [9] . For many years, the 
QR method was the algorithm of choice for computing the complete spectrum 
of tridiagonal matrices. A new algorithm, called dqds, is now used in LAPACK 
(routine spteqr) to compute all eigenvalues and, optionally eigenvectors, of a 
symmetric positive definite tridiagonal matrix by first factoring the matrix in the 
form where B is bidiagonal, and then using the routine sbdsqr to compute 

the singular values of the bidiagonal factor. The algorithm, proposed in [10], 

^ Available as LAPACK routine sstevd; a good choice if we desire all eigenvalues and 
eigenvectors of a tridiagonal matrix whose dimension is larger than about 25 [4, pg. 
217]. 

J.M.L.M. Palma et al. (Eds.): VECPAR2000, LNCS 1981, pp. 369-379, 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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computes singular values of bidiagonal matrices with high relative accuracy. An 
implementation of the dqds algorithm to compute the eigenvalues of symmetric 
tridiagonal positive matrices has been given in [ 11 ]. 

The bisection method is a robust method [12] but is slower than the other 
methods for the computation of the complete set of eigenvalues. However, be- 
cause of the excellent opportunities it offers for parallel processing, several paral- 
lel algorithms have been proposed which use bisection to isolate each eigenvalue 
and then some additional technique with better convergence rate to compute 
the eigenvalue to the prescribed accuracy [13], [14], [15], [16], [17]. One of such 
methods, dubbed zeroinNR, has been proposed in [2] and uses an original im- 
plementation of the Newton-Raphson’s method for this purpose. 

2 A Sequential Algorithm: zeroinNR 

Let A be a real, symmetric tridiagonal matrix, with diagonal elements oi, . . . , a„ 
and off-diagonal elements bi, . . . ,bn-i- The sequence of the leading principal 
minors of A is given by 

f Po(A) = 1 

< Pi (A) = oi - A ( 1 ) 

[ k(A) = (oi - A)pp_i)(A) - 6 fpp_ 2 )(A), t = 2,3, . . . ,n. 

It is well known that the number of variations of sign in this sequence equals the 
number of eigenvalues of A which are strictly smaller than A. 

To avoid overflow problems, the sequence (1) can be modified to the form 

( 9o(A) = 1 

I gi(A) = oi - A (2) 

[ g*(A) =P*(A)/pi-i(A), z = 2,3, ...,n 

and the terms of the new sequence can be obtained by the following expressions 
r 9 o(A) = 1 

< qi{X) = oi - A (3) 

[ qi{\) = (oi - A) - bf/q^-i{\), z = 2,3, . . . ,n 

where the number of negative terms qi{X), z = 0 , . . . , rz, is equal to the number 
of eigenvalues strictly smaller than A. This is the basis for the bisection method 
implemented in [18], which is known to have excellent numerical properties in 
the sense that it produces very accurate eigenvalues. The drawback of bisection 
is its linear convergence rate^ that makes the method slower than others, at 
least for the computation of the complete system. Different authors have pro- 
posed modifications of the simple bisection method in order to accelerate its 
convergence. One such proposal, dubbed zeroinNR method, has been given in 
[2] and essentially uses Newton-Raphson’s method to find an eigenvalue after 

The bisection method converges linearly, with one bit of accuracy for each step. 
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it has been isolated by bisection. The correction pn{xk)/Pn(xk), in the iterative 
formula of the Newton-Raphson method, 



Xk-\-l 



Xk - 



pjxk) 

p'{xkY 



( 4 ) 



is obtained without explicitly calculating the values of the polinomial Pn(xk) 
and its derivative p!^{xk), therefore avoiding overflow and underflow in such 
computations. From (2) we have 



Pi = QtPi-i 



( 5 ) 



and by differentiation 

Pi = q^P^-l + QiPt-i 

and carrying out the division by pi, we obtain the following expression 

Pi = S± + P±± 

Pi Qi Pi 



(6) 

( 7 ) 



which relates the arithmetic inverses of the Newton-Raphson correction for the 
polynomials Pi-i and pi, and their quotient qi. 

From the recursive expression (3), we have 



/ -t I T 2 ^i — 1 ■ o o 
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and carrying out the division by qi, 

A -|- 

Q2 V 

Using the notation 



q% q% 



q'-i 



Qi-i q%-i 



, f = 2,3, ...,n 



^Qi = q't/Qh ^Pi = Pi/ Pi 



(8) 

(9) 



(10) 



the complete computation of 



^Pn =p'n{x)/pn{x) 



( 11 ) 



is expressed as follows. 



qi = ai — X 
Aqi = Api = -I /qi 

qi = Qi - X - b^qi-i 'j (12) 

Aqi = (-1 -I- bf/qi-i * Aq^_i)/qi > f = 2, . . . ,n 

Api = Aqt + Ap^-l ) 

where Aq^ = q[{xk) / qi{xk) and Api = p[{xk)/pi{xk). 

It is important to observe that in the computation of Apn using the formulae 
(12), the values qi,i = 1, . . . , n, are obtained, and its signs can be used to derive 
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a method that combines bisection and Newton- Raphson’s iteration. We will refer 
to this method as the zeroinNR algorithm. 

So, given an interval [a, 0\ which contains an eigenvalue, and given an ap- 
proximation Xk e [ttj/?], the zeroinNR method will produce, in each step, an 
approximation Xk+i for the eigenvalue. 

The zeroinNR method although not as fast as the QR method (according to 
[2], zeroinNR is about two to four times slower than QR for the computation of 
all eigenvalues, depending on the characteristics of the spectrum) is consistently 
faster than simple bisection (generally, twice as fast) and retains the excellent 
numerical properties of simple bisection. 

In the present work we have introduced some modifications in the original 
zeroinNR method which actually make it faster. Since, for each iteration, the 
computation expressed in (12) takes about twice as long as a simple bisection 
step, we would like to have a convergence rate better than linear as soon as we 
switch from bisection to Newton’s method. In practice, we found that it is a 
good idea to perform a few more steps of simple bisection after the isolation of 
each particular eigenvalue. This guarantees that in the first use of formulae (12) 
the value of a; is a reasonably good initial approximation of the eigenvalue to 
start with the Newton’s method. 

Numerical tests were carried out in a transputer based machine using double 
precision arithmetic. The methods were implemented in Occam 2, the official 
transputer’s language. 

We were able to find out the errors in the computed eigenvalues since we 
have used matrices for which analytic expressions for the eigenvalues are known. 
We conclude that, for small matrices, the accuracy of zeroinNR is comparable 
to that of the QR method as implemented in the MatLab system [19], but as 
the size of the matrices grows, the zeroinNR method provides more accurate 
eigenvalues than QR method. 

This can be appreciated in Figure 1, where the absolute errors in the eigen- 
values of a matrix of size 1000, are plotted. We have used the tridiagonal matrix 
with Oj = 2 and bi = 1 which eigenvalues are given by 



— 2-1-2 cos 




i = 1, . . . ,n. 



3 An Efficient Parallel Algorithm: farmzeroinNR 

The sequential zeroinNR method can be readily adapted to parallel processing 
since several disjoint intervals can be treated simultaneously by different proces- 
sors. We have developed a parallel organization under a processor farm model 
and we will refer to this parallel implementation as the farmzeroinNR method. 
The typical architecture for this model is a pipeline of processors (workers), 
where the master sends tasks to workers and gets back the results produced. 

Each time a processor produces two disjoint intervals containing eigenvalues, 
as the result of a bisection step, it keeps only one of them and passes back to the 
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Fig. 1. Absolute errors of the eigenvalues of a matrix in = 1000) computed with 
zeroinNR and QR. 



master the second interval, which is kept in a queue of tasks. As soon as there 
is an available worker somewhere in the line, a new task is fed into the pipeline. 
Because of this mechanism, the algorithm achieves dynamic load balancing. 

A dynamic distribution of tasks results from the fact already mentioned, 
that as soon as a worker finishes a task, it will get a new one from the queue 
(which is managed by the master), if such queue is not empty. The advantage of 
such dynamic workload distribution gets more important as n grows. It must be 
noted that because some tasks take longer to finish than others, workers may not 
execute the same number of tasks, but will spend about the same time working. 

The worker pseudocode is given in Algorithm 1 . 



while not receive signal to terminate do 
’ interval ^ input jehannel 
if interval has more than one eigenvalue do 
J intervals <— bisection method(interval) 

1 output <— intervals (to the master) 
else — > 

J eig ^ extract isolate eigenvalue 
1 output <— eig (to the master) 

endif 

Algorithm 1: FarmzeroinNR worker processor pseudocode. 



The pseudocode for the master processor is given in Algorithm 2. 

It must be noted that messages exchanged between the master and some 
worker in the pipeline need to be routed through the processors that lay in 
between. For the global performance of the system it is important that messages 
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eig <— 0 

for fe <— 1 .. p — 1 do 

I worker [k] <— initial jinterval[k] 
procs ^ 0 
while eig < n do 

( case input jzhannel isa 

( if procs > 0 do 



interval 



eigenvalue 



\ output ^ interval to workers 
1 procs ^ procs — 1 
else — > 

I queue ^ interval 

endif 

eig <— eig + 1 
if queue not empty do 

I output ^ interval to workers 
else — > 

•[ procs ^ procs + 1 

endif 



send signal to terminate 

Algorithm 2: FarmzeroinNR master processor pseudocode. 



reach their destination as quickly as possible, therefore communication must be 
given priority over the computation. 

To compute eigenvectors, once we have computed (selected) eigenvalues, we 
can use inverse iteration. Convergence is fast but eigenvectors associated with 
close eigenvalues may not be orthogonal. The LAPACK’s routine sstein uses re- 
orthogonalization of such eigenvectors. This does not solve the problem when 
there is a cluster with many close eigenvalues [4, pg. 231], and recent progress 
on this problem appears to indicate that numerically orthogonal eigenvectors 
may be computed without spending more than 0(ji) flops per eigenvector [20]. 

4 Performance Analysis 

As already mentioned, a typical architecture for the processor farm model con- 
sists of a bidirectional array, forming a single pipeline (SP), with the master 
placed at one end of the array (Fig. 2). 




Fig. 2. Single pipeline, with 112 nodes. 



It is predictable that as the number of workers increases, the communication 
overhead becomes more significant and processors that are further away from the 




A Parallel Algorithm for the Symmetric Tridiagonal Eigenvalue Problem 375 



master take longer to communicate with him. Furthermore, the activity in the 
links of the processors which are closer to the master grows with the number of 
processors and some congestion is to be expected if the computational complexity 
of each task is not sufficiently large. In an attempt to overcome the problems just 
mentioned, we decided to test the parallel algorithm with a modified topology, 
referred to as multiple pipeline (MP), which consists of seven pipelines, each one 
with 16 transputers; the masters of such pipelines are themselves connected in 
a single pipeline (Fig. 3). 




Fig. 3. Multiple pipeline, with 112 nodes. 



At the beginning, the interval that contains all the eigenvalues is decom- 
posed in 7 subintervals of equal width which are distributed among the different 
pipelines. 

Although this may reduce to some extent communication overhead and wai- 
ting times, it has an important disadvantage which is an eventual deterioration 
of the load balancing, which becomes critical when some of the subintervals 
contain a much larger number of eigenvalues than others. Therefore, the spectral 
distribution of the matrix is an important factor to be considered when compa- 
ring the performance of the SP and MP architectures. For this reason we have 
used three different types of matrices (see Table 1 where ai,i = 1, . . . ,n, repre- 
sent the diagonal elements and bi,i = l,...,n — 1, represent the sub-diagonal 
elements) with different spectral distributions (see Figures 4 and 5) and sizes n 
ranging from one thousand to ten thousand. 
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Matrix 


Elements 


Analitical Formula 


I 


ai = a 
bi = b 




[ a + 2b cos \ 


II 


Oi = 0 

bi = — i) 




f r 

— n + 2fc — 1 > 

1 ) fc=i 


III 


bi = i{n — i) 




1 ) k=l 



Table 1. Matrix Types. 




Fig. 4. Spectral distributions for matrix I with n = 1000. 





Subintervals of [-999, 999] 



Subintervals of [-999000, 0] 



Fig. 5. Spectral distributions for matrix II (left) and matrix III (right), with 
n = 1000 . 
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We have computed the efficiency in the usual way, i.e., 



E = 



Ti 

112 Tii2 



where T\ represents the time taken by a single transputer executing the sequen- 
tial implementation of zeroinNR, and Tn 2 is the time taken by farmzeroinNR 
with 112 processors. In Table 2 such ratios are given, representing by E{SP) 
and £1(MP) the efficiency obtained for the single pipeline and multiple pipeline 
implementations, respectively. 





1 Matrix I 


jMatrix II 


[Matrix III | 


n 


E{SP) 


E{MP) 


E{SP) 


E(MP) 


E{SP) 


A(MP) 


1000 


55% 


45% 


61% 


71% 


60% 


35% 


5000 


80% 


56% 


92% 


89% 


90% 


38% 


7000 


93% 


64% 


91% 


85% 


94% 


39% 


10000 


95% 


56% 


99% 


91% 


97% 


39% 



Table 2. Efficiency of farmzeroinNR, for matrices of type I, II and III. 



As it can be appreciated from this table, the MP implementation is less ef- 
ficient than the SP implementation, except for the case of Matrices II of size 
n = 1000. In general, we have obtained better efficiency values with the SP 
architecture and we conclude that, for n sufficiently large, the communication 
overhead is not as important as the unbalance in the distribution of tasks intro- 
duced by the MP architecture. This is particularly clear in the case of matrix 
III since for the larger values of n the efficiency for SP is about 2.4 times better 
than the efficiency for MP. The explanation for this can be found in Figure 5 
(right side): the number of eigenvalues received by each one of the seven pipelines 
presents, in the case of matrix III, a large variation, from about 70 to about 370. 
Another important aspect that must be taken into account is that in the MP 
implementation there are only 105 workers, since 7 processors are playing the 
role of master. However, even if we had used the modified formula 




105 Tii2 



to compute the efficiency for the MP implementation, the values produced in 
this way would still be lower than those obtained for the SP architecture in most 
cases. 

5 Conclusions 

We have carried out a parallel implementation of an efficient algorithm, dubbed 
zeroinNR, for the eigenvalue problem of a symmetric tridiagonal matrix, on a 
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distributed memory system. The sequential zeroinNR method, although not as 
fast as QR, is consistently faster than simple bisection and retains the excellent 
numerical properties of this method. We have numerical evidence to support 
the claim that our method produces eigenvalues with smaller errors than those 
produced by QR. For the parallel implementation we used a farm model with 
two different topologies: a single pipeline (SP) of 112 processors and a multiple 
pipeline implementation (MP) consisting of seven pipelines, each one with 16 
processors. The MP architecture reduces the communication overhead to some 
extent but is not able to retain fully the excellent load balancing of the SP imple- 
mentation. This trade-off is not clear since it depends on the spectral distribution 
of each particular matrix. We have used matrices of different types to study this 
trade-off and conclude that for matrices sufficiently large, the parallel algorithm 
under the SP architecture performs better than the MP architecture. It must be 
stressed out that in the case of parallel machines where the ratio communica- 
tion/computation is larger, the MP architecture may prove to be better in all 
situations. In our experiments, we found that our parallel algorithm under the 
SP architecture is very efficient: for matrices of size n = 10000 we got efficiency 
values which are in all cases tested larger than 95%. 
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Abstract. Parallel algorithms for solving nonlinear systems are stud- 
ied. Non-stationary parallel algorithms based on the Newton method are 
considered. Convergence properties of these methods are studied when 
the matrix in question is either monotone or an Jf-matrix. In order to il- 
lustrate the behavior of these methods, we implemented these algorithms 
on two distributed memory multiprocessors. The first platform is an Eth- 
ernet network of five 120 MHz Pentiums. The second platform is an IBM 
RS/6000 with 8 nodes. Several versions of these algorithms are tested. 
Experiments show that these algorithms can solve the nonlinear system 
in substantially less time that the current (stationary or non-stationary) 
parallel nonlinear algorithms based on the multisplitting technique. 



1 Introduction 

Let F : IR” ^ H” be a nonlinear function. We are interested in the parallel 
solution of the system of nonlinear equations 

F{x) = 0, (1) 

where it is assumed that a solution x* exists. We suppose that there exists an 
To > 0 such that 

(i) F is differentiable on S'o = {x G H” : \\x — x*|j < ro}, 

(ii) the Jacobian matrix at x* , F'{x*), is nonsingular, 

(hi) there exists an L > 0 such that for x € So, \\F'{x)-F'{x*)\\ < L\\x-x*\\. 
Under assumptions (i)~(iii), a well-known method for solving the nonlinear 
system (1) is the classical Newton method (cf. [11]). Given an initial vector 
this method produces the following sequence of vectors 

^(^-K) = £ = 0,1,..., (2) 

where is the solution of the linear system 

F'{x^‘^'>)z = F{x^‘^'>). (3) 
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On the other hand, if we use an iterative method to approximate the solution 
of (3) we are in the presence of a Newton iterative method; see e.g., [11] and 
[12]. In order to generate efficient algorithms to solve nonlinear system (1) on 
a parallel computer, White [14] defines the parallel Newton-SOR method, that 
generalizes a particular Newton iterative method, the Newton-SOR method. 
In [14], White also introduces a parallel nonlinear Gauss-Seidel algorithm for 
approximating the solution of an almost linear system, that is, to solve (1) when 
F{x) = Ax+<P{x) — b, where A = (aij) is a real nxn matrix, x and b are n-vectors 
and ^ : IR" ^ H” is a nonlinear diagonal mapping (i.e., the fth component <Pi of 
^ is a function only of Xj). Bai [2], has generalized the parallel nonlinear Gauss- 
Seidel algorithm in the context of relaxed methods. Both methods are based 
on the use of the multisplitting technique (see [10]). On the other hand, Bru, 
Eisner and Neumann [4] studied two non-stationary methods (synchronous and 
asynchronous) based on the multisplitting method for solving linear systems in 
parallel. As it can be seen e.g., in [6] and [9], non-stationary algorithms behave 
better than the multisplitting method. Recently, in [1] we have extended the 
idea of the non-stationary methods to the problem of solving an almost linear 
system. These methods are a generalization of the parallel nonlinear Gauss-Seidel 
algorithm [14] and the parallel nonlinear AOR method [2]. 

In this paper we construct a parallel Newton iterative algorithm to solve the 
general nonlinear system (1) that uses non-stationary multisplitting models to 
approximate linear system (3). For this purpose, let us consider for each x, a 
multisplitting of F'{x), {Mk{x), Nk{x), that is, a collection of splittings 

F'{x) = Mk(x) - Nk{x), l<k<p, (4) 

and diagonal nonnegative weighting matrices Ek which add to the identity. 
Let us further consider a sequence of integers q{£,s,k), £ = 0,1,2, ..., s = 
1,2, . . . ,ni£, 1 < k < p, called non-stationary parameters. Following [4] or [9] 
the linear system (3) can be approximated by as follows 

-I- B£^s{x^^'^)F{x^^^), s = 1, 2, . . . , mt, 

HeA^) = ^ Afc (M-i(x)fVfc(x))'^"’*’'^^ , (5) 

k^l 

p q{i,s,k) — l 

BU^) = J2Ek {M^\x)Nk{x)y M^\x) (6) 

k—l j—0 

= Ffc (/ - (M-i(x)fVfc(x))«(^'^>^)) {F'{x))-\ (7) 






and =0. Thus 



' m£ — 1 m£ 

xif'+i) = ^ ^ n + 

i=l 
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where denotes the product of the matrices in the 

order ■ ■ ■ He,i+i{x^^^). Therefore, from (2) the non-sta- 

tionary parallel Newton iterative method can be written as follows. 

(8) 

where 

(^) — X (^)-^(^) 

and 

( m£ — 1 \ 

E n Hi^j(x)Bi j^(x) -\- 1 • (9) 

i=i j=i+i j 

We note that the formulation of this method allows us to use different number 
of local iterations g(f, s, k) not only in each processor k and at each nonlinear 
iteration £ but at each linear iteration s. Moreover, this method extends the 
parallel Newton method introduced by White [14]. 

In the following section we analyze the convergence properties of this algo- 
rithm when the Jacobian matrix is monotone or an i/-matrix. Section 3 contains 
some numerical experiments, which illustrate the performance of the algorithms 
studied, on an Ethernet network of five 120 MHz Pentiums and on an IBM 
RS /6000 SP. In the rest of this section we present some notation, definitions and 
preliminary results used in the paper. 

A matrix A is said to be a nonsingular M-matrix if A has all nonpositive 
off-diagonal entries and it is monotone, i.e., A~^ > O. For any matrix A = 
(aij) e IR"^”, we define its comparison matrix (A) = (a^) by an = \au\, aij = 
— |ajj I, i 7 ^ j. The matrix A is said to be an iJ-matrix if (A) is a nonsingular M- 
matrix. The splitting A = M — N is called a weak regular splitting if M~^ > O 
and M~^N > O; the splitting is an i/-compatible splitting if (A) = (M) — |A^|; 
see e.g., Berman and Plemmons [3] or Varga [13]. 

A sequence converges Q-quadratically to x* if there exists c < 1 such 

that 

||^(£-ei) _ ^*jj < 

L(IR") denotes the linear space of linear operators from IR" to IR”. 

Lemma 1. Suppose that the mapping A : D C IR”" ^ L(IR”) is continuous at 
a point x^ D for which A(a;°) is nonsingular. Then there is a 5 > 0 and a 
P > 0 so that, for any a; G U n {a; : ||a; — a;°|| < J}, A(a:) is nonsingular and 
l|bl(a:)“^|| < p. Moreover, A(a:)“^ is continuous in x at x^ . 

Proof. See Ortega and Rheinboldt [11]. 

Theorem 1. Suppose F : D C IR” — > IR™ is G -differentiable at each point of a 
convex set Dq C D, then for any x,y,z & Dq, 

\\F{y) - F{z) - F\x){y - z)\\<\\y - z\\ sup \\F' [z + t{y - z)) - F' [x)\\. 

0<t<l 

Proof. See Ortega and Rheinboldt [11]. 
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2 Convergence 

In this section we study the convergence of the iterative scheme (8). For this 
purpose we need to make the following additional assumptions on the splittings 

(4). 

(iv) There exist tk > 0, I < k < p, such that for x G So, ||Mfe(a;) — Mk{x*)\\ < 
tk\\x - x*\\. 

(v) Mk{x*), 1 < k < p, are nonsingular. 

(vi) There exits 0 < a < 1, such that, for each positive integer s and £ = 0,1,..., 

\\He,s{x*)\\ < a, 

where Hi^s{x*) is defined in (5). 

From assumptions (i)-(iii) of Section 1 and using Lemma 1, it follows that 
there exists 0 < ri < ro such that F' is continuous and nonsingular in S\ = 
{x £ H" : ||a; — x*\\ < ri}. On the other hand, it can be shown (see e.g., Ortega 
and Rheinboldt [11]) that Newton method (2) converges Q-quadratically to x* 
in a neighborhood of x*. In order to simplify the notation we also denote this 
neighborhood by Si. From assumptions (iv)-(v) and Lemma 1, it follows that 
Mfc, 1 < fc < p, is continuous and nonsingular in a neighborhood of x* , say 
again Si. Therefore Mk{x)~^ Nk{x), \ < k < p, is well defined and moreover 
continuous in ^i. Then, Hi^s{x) is also continuous in ^i. Now, from assumption 
(vi) it obtains that ||iL^_s(a:)|| < a, £ = 0, 1, ... , s = 1, 2, . . . , mg,, in a neighbor- 
hood of X* , denoted again by ^i. Moreover, since Mk{x)~^ Nk{x), I < k < p, 
are continuous in there exists a positive integer K such that 

\\Mk{x)~'^Nk{x)\\ < K, l<k<p, (10) 

for all a: in a neighborhood of x* , that we denote again by Si. 

Lemma 2. Let A : H" ^ L(IR”) he a mapping such that ||j4(x)|| < S, in a 
neighborhood S of x* . Then for any x G S and for any positive integer m 

\\A{x)'^ - A{x*)^\\ < mS^-^\\A{x) - A{x*)\\. 

Proof. We proceed by induction. For m = 1, the result follows obviously. Suppose 
that the result is true for m = k. Then 

II A{xf+^ - A{x*f+^ II = P(a;)'=+i - A{xf A{x*) + A{xfA{x*) - A{x*f+^ || 

< \\A{x)\A{x) ~ A{x*m + \\{A{x)’^ - A{x*)'^)A{xn\\. 

< d'=||^(a;) - Gl(x*)|| -f kS’^-'^\\A{x) - Gl(a;*)||5 = {k + l)S'^\\A{x) - ^(a;*)||, 



and the proof is complete. 
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Lemma 3. Let x € Si and let m be a positive integer, then 

171 

^ - n = ^e,m{x)F'{x) £ = 0,1,.... 

i=i 

Proof. Let x Si, then from (7) 

Be^,{x)F'{x) = / - He^s(x), s = 1,2, . . . ,m, £ = 0,1, . . . , (11) 

where F[e^s{x) and Be^g{x) are defined in (5) and (6) respectively. Then from (9) 
and (11) we obtain 

m— 1 m 

Ae,m{x)F'{x) = X] n a - Hl,i{x)) + (/ - Hi^rn{x)) 

i—1 

m—1 / m m \ 

= X! n ^Lj(x) -Y[He^j{x) I + (-f - He^rnix)) 

i=l yj=i+l j=i j 

171 

= I-l[Hi,^{x), 
i=i 

and the proof is done. 

Lemma 4. Let x € Si and let m be a positive integer, then 

mm m 

II n H,^^{x) - n i7,j(a:*)ll < a’-i ^ ||ff,,,(x) - ff^,,(a:*)||, ^ = 0, 1, . . . . 

i=i i=i i=i 

Proof. In order to show this result, we proceed by induction. Obviously, the 
result follows for m = 1. Suppose that the result is true for m = fc. Then taking 
into account that ||i7^_s(a;) || < a, we can write 

fc+1 fc+1 fc+1 k 

\\Y]_Hej{x) - Y[He,j{x*)\\ = || Y[He^j{x) - He^k+i{x)Y[Hej{x*) 
i=i i=i i=i i=i 

k fc+1 

+ He,k+i{x) n He,j{x*) - He,j{x*)\\ 

k k 

< ||i7£,fc+i(x)|||| Y[He^j{x) - Y[He^j{x*)\\ 

i=i i=i 

k 

+ ||J7^,fc+i(x) -ff^,fc+i(x*)|||| [|77,,,-(a;*)|| 
k k 

< all Y\Hi^j{x) - W_Hi^j{x*)\\ + \\He^k+i{x) ~ He^k+i{x*)\\a^ 

i=i i=i 
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k 

< aa^~^ ^ + o^\\Hi^k+i(.x) - Hi^k+i{x*)\\ 

i=i 

fc+1 

i=i 

and the proof is complete. 



Lemma 5. Suppose assumptions (i)-(v) are satisfied. Assume further that the 
sequence of number of local iterations q{£, s,k), £ = 0,1, . . . , s = 1,2, ... , mg, 
1 < k < p, remains bounded by q > 0. Then, there exists L* > 0 such that, for 
any x ^ S\ and for any positive integer s, it follows 

WHi^fix) - Hi^fix*)\\ < L*\\x - x*||, f = 0, 1, . . . . 

Proof. Let x e Si, from (iii), (iv) y (v) it is known (see e.g., [12]) that there 
exists Tfe > 0, 1 < fc < p, such that 

\\Mfi\x)Nk{x) - Mfi\x*)Nk{x*)W < rk\\x - x*]]. (12) 

On the other hand, if we denote Rk{x) = M^^{x)Nk{x), using Lemma 2 and 
(10), it obtains 

||^^(^)9(L»,fe) _i?^(a.*)9(L»,fe)|| < q{£,s,k)K‘^^^’^’'^'>-^\\Rk{x) - Rk{x*)\\. (13) 
Therefore from (10), (12) and (13), we have 

m,fix) - HeAx*)\\ < 

k^l 

p p 

< ^ \\Ek\\q{£,s, - a;*|| < ^ ||L;fe|| {qK"^~'^rk)\\ x - a;*||,(14) 

k^l k^l 

with K' = max{l,7^}. Then \\H£^s{^) ~ He,s{x^)\\ ^ L*\\x — x*||, with L* = 
J2\\Ek\\{qK'‘^-"rk). 

fe=i 



Lemma 6. Let assumptions (i)-(vi) hold and suppose that the sequence of num- 
ber of local iterations q{£,s,k), £ = 0,1,..., s = 1,2,..., mi, 1 < k < p, 
remains bounded by q > 0, then there exists ci < -l-oo, such that for any x ^ S\, 

\\Gi^rn{x) - a;*|| < Ci||x - x*||^ -l- \\x - x*||, £ = 0, 1, ... , 
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Proof. From Lemma 3 it follows 
II Ge^rnix) - X* II = ||x - Ae^rn{x)F{x) - X* II 

m m 

< II - Ae,m{x)F{x) + (i” - n “ ^*)ll + II n ~ ^*)ll 

i=i i=i 

m 

= II - Ai^rn{x)F{x) + Ai^rn{x*)F' {x*){x - a;*)|| + II He^j{x*){x - a;*)||. 

i=i 

Then by assumption (vi), for £ = 0, 1, . . it obtains 

\\Ge,m{x) - X*|| < II - Ai^ru{x)F{x) + Ai^rn{x*)F' {x*){x - a;*)|| + a^\\x - a;*||. 

Now, since F{x*) = 0, we have the following inequalities 

\\Ge,m{x) - a:*|| < II - Ai^rn{x) (F{x) - F{x*) - F'{x*){x - x*)) || 

+ \\Ae,m{x) (F'{x) ~ F'{x*)) {x - a;*)|| 

+ 11 {Ae,mix)F'{x) - Ai^rn{x*)F'{x*)) {x - x*)|| 
+a’”||x-x*||. (15) 

On the other hand from (9), and using assumption (vi) we have 

m— 1 m 

ii>itaWii= E n Hej{x)Be^,{x) + S£,m(a^) 

2=1 

m—1 m 

- X! II n ■^Lj(a^)llll^d*(a^)ll + ll-Bbm(a;)|| 

2=1 j = i -\-\ 

m—1 

< Y,a^-^\\Be,,{x)\\ + \\Bi^m{x)\\. (16) 

2=1 

By the definition of given by (6), and using (10), it obtains 

p q{i,s,k) — l 

\\Be,s{x)\\ < \\Ek\\ \m,f\x)N,{x))’^{x)\\\\M^\x)\\ 

fc=l h—0 

p q{i,s,k) — l 

<J2\\Ek\\ E ^"lll|A^fe-'(a;)||. 

k—1 h—0 

That is, since the sequence q{£, s,k), £ = 0, 1, ... , s = 1, 2, . . . , m^, 1 < A: < p, 
remains bounded by g > 0, we have 

||i?,,,(x)|| <E||Sfc||EiL'‘||||M-i(a;)||. 

k—1 h—0 



(17) 
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Let P = max{Pi, P 2 , ■ ■ ■ , Pp}, where Pk = sup{||Mfc(a:) ^|| : x e S'!}. The 

existence of Pk, ^ < k < p, follows from Lemma 1. Then, from (17) it obtains 

\\BeAx)\\ 

k—1 h—0 



Thus, 



\\Be,s{x)\\ < K* , £ = 0,1,..., s = l,2,...,me. 



(18) 



p 9-1 

for all X e Si, where K* = ^ \\Ek\\ ^ > 0. 

k—1 h—0 

Now, by (16) and (18), for any x Si and for any positive integer m, we have 



_ _ / I \ _ 

\\Ai,m{x)\\ < + if* < I + 1 J if* = if* 

i=i \ ^ / 

Now, using (19) we bound (15). From Theorem 1, it follows 

II - A,m{x)iF{x) - F{x*) - F'{x*){x - a;*))i| < 



(19) 



< II - ^t,m(a;)|!||(a:^ - a;*)|| sup || (F(x* -b t(a; - x*)) - T'(x*)) || 

0<t<l 

< II - 7l£,m(a;)||||(a: - a;*)|| sup L||t(a: - x*)|| < if*L||(a: - x*)||^. 

0<t<l 

On the other hand, by condition (iii) we have 

Il^t,m(a^) {Fpx) - F'{x*)) {x - a;*)|| < ||A£,„(a:)|| ||F'(a;) - F'(a;*)|| ||(a; - a;*)|| 

<if*L||(x-x*)|y 

Using Lemmata 3, 4 and 5 it obtains 

II iAe,m{x)F'{x) - Ae^rn{x*)F'{x*)){x - x*)|| = 

( m m \ 

n - n -^bj(a;*) {x-x*) 

1=1 1=1 / 

m 

— ^ l|ii^.l(2^) ~ iif,l(^*)ll Ik ~ 3:*|| < ma™~^L*\\x — x*\\'^. 

1=1 

Since a < 1, {ma^~^} is upper bounded. Let C2 (dependent of a) an upper 
bound of this set, then setting ci = 2K*L + c^L* , the proof is complete. 

Remark 1. We want to point out that since we know nothing about the bound 
if in (10), we need, in lemmata 5 and 6, the sequence q{£,s,k) to be bounded 
by (7 > 0. If we have if < 1, then we do not need that upper bound for the 
non-stationary parameters q{£,s,k) (see (14) and (17)). 
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Theorem 2. Let assumptions (i)-(vi) hold and F{x*) = 0. Let be a 

sequence of positive integers, and define 







'll 


1 

m = max 


{too} U < : £=1,2,.. 






1 i=0 


J J 



(20) 



Suppose that m < +oo and that the sequence of non- stationary parameters 
q{£, s,k), £ = 0, 1, . . . , s = 1, 2, . . . , m£, 1 < k < p, is bounded by q > 0. Then, 
there exist r > 0 and c < 1 such that, for £ S' = {a; € IR” : \\x — a;*|| < r}, 
the sequence of iterates defined by (8) converges to x* and satisfies 

||a;(«+i) _a;*|| < -x*||. 

Proof. Let ci be as in Lemma 6. Let c > 0 be such that < c < 1. Since 

a < c"^, there exists 0 < r < n, such that 



cir + a < c™, 



and then, 

(cir + < c < 1. 

Now, we proceed by induction. For £ = 1, using Lemma 6, we have 

||a;W -a;*|| = ||Go.™„(xW)-x*j| <cil|x(o) -a;*f + a’"«||xW -x*|| 

< (cir + -x*|l. 

Since cir + a™” < cir + a < c™ < then ~ a^*ll ^ — x*||. 

Therefore the result follows for f = 1. Suppose that the result is true for 0 < £ < 
j. Then 

||a;(i) _^*|| < -x*|| < 

s^O 

Now, for ^ = j + 1, from Lemma 6 it follows 

W^U+i) _ 3 ,*|| = < {ciWx^^'^ -x*\\+a^’)\\x^^^ -x*\\ 

i-1 

< (ci(n 

s^O 

i-1 

< + -^*11 < + -a;*i| 

s^O 

< ((c™ - + a^^)\\x‘'^'^ - a;* II 

= c™^'((c™ - a)c-™ + c~'^^)\\x‘'^'^ - a;*|| 

= c™^'(l - OC-™ + 

= c™^'(l + - l))||a:(^') - a;*||. 

Since 0 < a < c*” < 1, then ac“™ < 1. On the other hand, 0 < a^o~^c^~'^3 < 
^ ^ ^ - 1 ) < 0 . 
Therefore, ||a;(^“''^^ — a;*|| < — a:*j|, and the proof is complete. 




Non-stationary Parallel Newton Iterative Methods for Nonlinear Problems 



389 



Theorem 3. Let assumptions (i)-(iv) hold and F{x*) = 0. Let be a 

sequence of positive integers, and define m as in (20). Suppose that m < +oo. 
If any of the following two eonditions is satisfied 

1. F'{x*) is a monotone matrix and F'{x*) = Mk{x*) — Nk{x*), 1 < k < p, are 
weak regular splittings, 

2. F'{x*) is an H-matrix, F'{x*) = Mk{x*) — Nk{x*), 1 < k < p, are FI- 
compatible splittings, 

then, there exist r > 0 and c < 1 such that, for x^^'^ & S = {x & H” : \\x — x* || < 
r}, the sequence of iterates defined by (8) eonverges to x* and satisfies 

Proof. Under conditions 1 and 2 and taking into account respectively, the proofs 
of Theorem 2.1 of [4] and Theorem 3.1 of [9] we obtain that assumptions (v), 
(vi) and (10) with K <1 are satisfied. Then, the proofs follow from Theorem 2 
and Remark 1. 



3 Numerical Experiments 

We have implemented the above method on two distributed multiprocessors. 
The first platform is an IBM RS/6000 SP with 8 nodes. The second platform is 
an Ethernet network of five 120 MHz Pentiums. In order to manage the parallel 
environment we have used the PVMe library of parallel routines for the IBM 
RS/6000 SP and the PVM library for the cluster of Pentiums [7], [8]. 

In order to illustrate the behavior of the above algorithms, we have considered 
the following semilinear elliptic partial differential equation (see e.g., [5], [12], 

[14]) 

~{K^Ua:)x - {K^Uy)y = -ge“ 

u = a;^ + 

where 

= K^{x,y) = 1 + + y^, 

= K‘^{x,y) = l + e^ + efi 
9 = g{x,y) = 2(2 + 3a;^ + y^ + + (1 + y)e^)e"“^“^^ 

12 = (0,1) X (0,1). 

It is well known that this problem has the unique solution u{x, y) = + y^. To 

solve equation (21) using the finite difference method, we consider a grid in f? 
of df nodes equally spaced by h = Ax = Ay = l/(d + 1). This discretization 
yields a nonlinear system of the form Ax + <P{x) = b, where : IR" ^ IR" is 
a nonlinear diagonal mapping and H is a block tridiagonal symmetric matrix 
A = where Ti are tridiagonal matrices of size d x d, i = 

1,2,. . . ,d, and Di are dx d diagonal matrices, z = 1, . . . , d — 1; see e.g., [5]. Let 

p 

S' = {1, 2, . . . , n} and let Sk, k = 1,2, ... ,p, he subsets of S such that S = Sfe. 

fe=i 



{x,y) e 12, 
{x,y) e df2. 
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Let us further consider a multisplitting of F'{x), where F{x) = Ax + <!>{x) — b, 
of the form 



{D{x) - Lk,Uk,Ek}l^i, where Lk 



f -aij, j < i and i,j G Sk, 
[ 0, otherwise, 



with 



k 

Sk = + 

i<k i—1 



P 

i<k<p, '^nk=n, Uk > 0, 
fe=i 



D{x) = diag(A) +diag(^i(a:i), . . . 

and the nxn nonnegative diagonal matrices Ek, 1 < k < p, are defined such that 
their zth diagonal entry is null Hi ^ Sk- Note that this multisplitting is a Gauss- 
Seidel type multisplitting. The stopping criterion used was — v \\2 < 
where || ■ ||2 is the Euclidean norm and v is the vector which entries are the 
values of the exact solution of (21) on the nodes (ih,jh), i,j = l,...,d and the 
initial vector was = (1, . . . , 1)^. All times are reported in seconds. 

We have run our codes with matrices of various sizes and different multi- 
splittings depending on the number of processors used (p) and the choice of 
the values Uk, 1 < fc < p, but to focus our discussion, we present here results 
obtained with d = 64, that originates a nonlinear system of size 4096. The con- 
clusions we present here can be considered as representative of the larger set of 
experiments performed. 



Time 







■ 

1 








BH 




In 


imsnllS 



u 


rrif = l 


mf = 2 


m, = e 


mi =2 '' 


□ q=l 


27,59 


15,91 


4,73 


7,73 


□ q=4 


8,08 


5,11 


2,35 


4,11 


□ q=9 


4,93 


3,43 


2,27 


1,86 



Fig. 1. Non-stationary parallel Newton Gauss-Seidel methods 
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Figure 1 shows the behavior of some non-stationary parallel Newton iterative 
methods on an IBM RS/6000 SP multiprocessor using four processors and Uk = 
1024, 1 < A: < 4. This figure illustrates the influence of the non-stationary 
parameters q{k) = q, 1 < A: < 4, in relation to me = 1, 2, £, 2^. We want 
to note that, for a fixed number of processors, the computational time starts to 
decrease as the non-stationary parameters increases until some optimal value of q 
{q = 9, in Figure 1) after which time starts to increase. This behavior is typical of 
non-stationary methods; see e.g. [6] and [9]. In general, this optimal value is hard 
to predict but if the decrease in the iterations balances the realization of more 
local updates then less execution time is observed. This situation is independent 
of the choice of me. On the other hand, in this figure it can also be observed 
that the best non-stationary parallel methods were obtained setting me = t and 
me = 2 ^. 



Time 



30 

20 

10 



U - 


Cluster mf = P 


IBM SP m[ = t 


□ q=l 


28.33 


5.36 


□ 

II 

to 


18.99 


3.93 


□ 

O 


14.01 


3.01 


□ q=16 


14.54 


3.16 



Fig. 2. Cluster of Pentiums versus IBM RS/6000 SP (2 processors) 



Figure 2 shows the behavior of some non-stationary parallel Newton iterative 
methods in relation to the parallel computer system used. In this figure we have 
used two processors, = 2048, k = 1,2, and me = i. The conclusions were 
similar on both multiprocessors, however, the computing platform has obviously 
an influence in the performance of a parallel implementation. Note that when 
q = \, the method reduces to the well-known parallel Newton Gauss-Seidel 
method (see [14]) and as it can be appreciated this method is always worse 
than the non-stationary parallel methods. Moreover, we have compared these 
methods with the algorithms presented in [1] . We have observed that the methods 
discussed here behave better than those algorithms. For example, for the matrix 
of size 4096, the best time we have obtained with the IBM RS/6000 SP using 
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four processors (see Figure 1) is 1.86 seconds, however the best times obtained 
with the other methods (see Table 1 and 2 of [1]) were about 6 seconds. 




Fig. 3. Non-stationary methods {q = 9) and sequential Newton Gauss-Seidel 
method 



On the other hand in Figure 3 we have compared the algorithms of this 
paper, setting q = 9, with the well-known sequential Newton Gauss-Seidel (N- 
GS) method [11] versus the number of processors in the IBM RS/6000 SP. The 
best CPU time performed by this sequential method was obtained with me = i. 
So, if we calculate the speed-up setting such sequential method as reference 
algorithm, 

CPU time of sequential Newton-Gauss Seidel algorithm, {me = i) 

REAL time of parallel algorithm 



An efficiency 

Speed-up 

processors’s number 

of about 90% and 60% can be obtained with 2 and 4 processors, respectively. 
Similar efficiencies were obtained for the cluster of Pentiums. However it does not 
happen the same with the parallel Newton Gauss-Seidel method ([14]). That is, if 
q = 1, we have obtained efficiencies only about 0 — 30% in both multiprocessors. 

Finally, Figure 4 illustrates the influence of the relaxation parameter uj when 
non-stationary parallel Newton-SOR methods are used. In Subfigure 4(a) we 
have considered for the system of size n = 4096 some non-stationary parallel 
Newton-SOR methods using four processors, Uk = 1024, 1 < /c < 4, and me = £, 
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a) n = 4096 



a) n = 11664 





Fig. 4. Non-stationary Newton-SOR methods 



and for each one we recorded the REAL time in seconds on the IBM RS/6000 SP. 
Moreover, these results were compared to the corresponding parallel Newton- 
SOR method ([14]). As it can be appreciated the conclusions were similar to 
those described along this section. 

As it has been mentioned before, the results showed in this paper correspond 
to the system of size n = 4096 using the stopping criterion (1) — n ||2 < h"^- 

We have obtained an identical performance for all the systems tested. Moreover, 
the conclusions were independent of the stopping criterion used. Subfigure 4(b) 
illustrates these facts. In this figure we show the results obtained for the system of 
size n = 11664 {d = 108) using the stopping criterion (2) ||i < 10“^. 

An stopping criterion of the type (2) is a possible stopping criterion when the 
exact solution is not known. The results of Subfigure 4(b) have been obtained 
using 4 processors, Uk = 2916, 1 < A: < 4, and = i. 
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Abstract. We present a parallelization of Petkov, Christov, and Kon- 
stantinov’s algorithm for the pole assignment problem of single- input sys- 
tems. Our new implementation is specially appropriate for current high 
performance processors and shared memory multiprocessors and obtains 
a high performance by reordering the access pattern, while maintaining 
the same numerical properties. 

The experimental results on two different platforms (SGI PowerChal- 
lenge and SUN Enterprise) report a higher performance of the new im- 
plementation over traditional algorithms. 



1 Introduction 

Consider the continuous, time-invariant linear system defined by 
x{t) = Ax{t) + Bu{t), a;(0) = a;o, 

with n states, in vector x{t), and m inputs, in vector y{t). Here, A is the n x n 
state matrix, and B is the n x m input matrix. 

In the design of linear control systems, u{t) is used to control the behaviour 
of the system. Specifically, the control 

u{t) = —Fx{t), 

where E is an m x n feedback matrix, is used to modify the properties of the 
closed-loop system 

x{t) = {A — BF)x{t). 

* Supported by the Consellerfa de Cultura, Educacion y Ciencia de la Generalidad 
Valenciana GV99-59-1-14 and the Eundacio Caixa-Castello Bancaixa. 
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The problem of finding an appropriate feedback F is referred to as the prob- 
lem of synthesis of a state regulator [11]. In some applications, e.g., for asymp- 
totic stability [4,11], F can be chosen so that the eigenvalues of the closed- loop 
matrix are in the open left-half complex plane. 

In this paper we are interested in the pole assignment problem of single-input 
systems (m = 1 and B = b is a vector), or PAPSIS, which consists in the 
determination of a feedback vector F = f, such that the poles of the closed- 
loop system are allocated to a pre-specified set A = {Ai, A 2 , . . . , A„} [4]. This 
problem has a solution (unique in the single-input case) if and only if the system 
is controllable [15]. We assume hereafter that this condition is satisfied. 

A survey of existing algorithms for the pole assignment problem can be found, 
e.g., in [4,5,6,11,14]. Among these, methods based on the Schur form of the 
closed-loop state matrix [6,9,10] are numerically stable [3,7]. 

In [2] we apply block-partitioned techniques to obtain efficient implementa- 
tions of Miminis and Paige’s algorithm for PAPSIS [6]. In this paper we apply 
similar techniques to obtain LAPACK-like [1] block-partitioned variants and par- 
allel implementations of Petkov, Christov, and Konstantinov’s algorithm (here- 
after, PCK) [10] for PAPSIS. 

We assume the system to be initially in unreduced controller Hessenberg 
form [13]. This reduction can be carried out by means of efficient blocked algo- 
rithms based on (rank-revealing) orthogonal factorizations [12]. 

Our algorithms are specially designed to provide a better use of the cache 
memory, while maintaining the same numerical properties. The experimental 
results on SGI PowerChallenge and SUN Enterprise multiprocessors report the 
performance of our block-partitioned serial and parallel algorithms. 



2 The Sequential PCK Algorithm 

Consider the controllable single-input system in controller Hessenberg form de- 
fined by {A,b), with real entries. 



{b\A) 



Pi 



Oill ■ ■ ■ ai,n-l Oiin 
021 ■ ■ ■ 0;2,n-l 02n 






( 1 ) 



As the system is controllable, it can be shown that /3i, 021, . . . , o;n,n-i 0 [13]. 

The PCK algorithm is based on orthogonal transformations of the eigenvec- 
tors and proceeds as follows. (For simplicity we only describe the algorithm for 
pole assignment of real eigenvalues.) Let A G IR and v e H” be, respectively, an 
eigenvalue and its corresponding eigenvector of the closed- loop matrix A — bf. 
Let Q be an orthogonal matrix such that Qv = (ui, 0, . . . , 0)^. This matrix 
can be constructed so that Q^'^AQ and Q'^(A — bf)Q are in Hessenberg form. 
Furthermore, 



Q^{A - bf)Qei = (A, 0, . . . , 0)^, 



(2) 
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where e\ is the first column of the identity matrix, and solving (2) we find the 
first element of the transformed feedback / = fQ from the corresponding ele- 
ments of Q"’" AQ and Q^b. After this stage, the procedure is repeated with the 
lower trailing blocks of order n — 1 of the transformed matrices to assign a new 
pole. By proceeding recursively we obtain /, and / = /Q^. The procedure for 
assigning A = {Ai, A 2 , . . . , A„} can be roughly stated as follows. 

for z = 1, . . . , n — 1 

Set = 1 and compute Vn-i 
for j = n — 1, n — 2, . . . , i 
Compute Vj-i 

Construct a Givens rotation Rij+i G such that 

(ui, . . . , Vj,Vj+i,0, 0)Rij+i = {vi, 

Apply the transformation A = Rij+i 
end for 

Apply the transformation b = Ri^i+ib 
Compute fi = Ui+i^i/bi+i 

end for 

Compute /„ = (a„,„ - A„)/&„ 

At each iteration of the outer loop a new pole is assigned. In the inner loop, 
at each iteration we compute a component of eigenvector v (j — 1), obtain a 
transformation to introduce a zero in a component of the eigenvector (j -|- 1), 
and finally apply this transformation on the system matrix. 



3 Parallelization of the PCK Algorithm 

In traditional implementations of this algorithm each transformation matrix 
Ri.j+i is applied immediately after it is computed. Thus, at each iteration of 
loop j, two rows and columns (j-th and j + 1-th) of the matrix are referenced. 

Our block-partitioned algorithms reduce the number of data references by 
delaying the update of some entries the matrix. Thus, we work on the trans- 
formed lower Hessenberg matrix , partition this matrix by blocks of columns 
(see figure 1), and delay the application of transformations from the left until 
the proper block is referenced. Although the parameters of the delayed transfor- 
mations need to be stored, the dimension of this work space is small. 

Specifically, consider the assignment of the first pole in the block-partitioned 
algorithm: 

- A set of transformations are computed to shift up the pole, until it dissapears 
on the top left corner of block Bl, and the transformations are only applied 
to Bl. The application of this update from the left to blocks B2, . . . , B6 is 
delayed. 

- The procedure continues with block B2. First, the delayed update is applied 
from the left to B2. Then, a new set of transformations are computed to 
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B6 



B5 



B4 



B3 



B2 



Bl^ 



Fig. 1. Partition of the matrix by blocks of columns. 



shift up the pole, and these transformations are only applied to B2. (The 
application of this update from the left to blocks B3, . . . , B6 is delayed.) 

- The procedure is repeated with blocks B3, B4, B5, and B6, until the pole is 
assigned and the problem is deflated. 



In the parallel algorithm we are interested in an algorithm with a higher (and 
coarser) degree of parallelism than that achieved with the application of a single 
transformation. Notice that in each iteration of the inner loop j two rows and 
two columns of the matrix are modified. Thus, as soon as j = n — 4, it would be 
possible to start the assignment of a different pole. 

This is a pipelined algorithm. Specihcally, the assignment of a new pole can be 
started as soon as the transformations related to the previous pole do not affect 
to the last block of columns. Thus, it is possible to assign in parallel as many 
poles as blocks in the partition of . 

In our algorithm, the maximum number of pipelined stages is , where n and 
nb are the problem size and block size respectively. Figure 2 shows the evolution 
of the different stages in our pipelined algorithm. As the problem is deflated, 
the number of blocks of columns (and therefore the number of pipelined stages) 
decreases. In practice, Nb must be larger than three; otherwise, the stages can 
not be correctly pipelined. 



4 Experimental Results 

In this section we report the results of our numerical experiments on a SGI 
PowerChallenge (SGI MIPS RIOOOO) and a SUN Entreprise 4000 (SUN Ul- 
traSPARC) multiprocessors. All our experiments were performed using IEEE 
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(a) Stage 1, step 1. (b) Stage 1, step 2. 



r — — — — T i— 




(c) Stage 1, step 3. 
Stage 2, step 1. 







(d) Stage 1, step 4- 
Stage 2, step 2. 




(e) Stage 1, step 5. 
Stage 2, step 3. 
Stage 3, step 1. 




(f) Stage 1, step 6. 
Stage 2, step 4- 
Stage 3, step 2. 



Fig. 2. Evolution of the pipelined algorithm. 
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double-precision arithmetic and Fortran-77 (e w 2, 2 x 10“^®). We have employed 
in our implementations orthogonal transformations based on Givens rotations. 
The system pair (A, b) was generated so that the computation of the feedback 
matrix was well-conditioned. 

We have developed the following pole-assignment algorithms: 

BPAPSIS: Block-partitioned algorithm. 

PPAPSIS: Parallel version of the block-partitioned algorithm. 

Figure 3 shows the speed-up of our block-partitioned algorithm for different 
block dimensions and problem sizes, nb and n respectively. We test system of 
moderate size from 100 to 1000, using block sizes of {nb =)1 (non-blocked al- 
gorithm), 32, 64 and 100 for the SGI MIPS RIOOOO processor, and nb= 1, 16, 
32 and 64 for the SUN UltraSPARC processor. The results are averaged for 5 
executions on different random matrices. In all the experiments the blocked im- 
plementations clearly outperform the sequential code {nb = 1), except on SGI 
MIPS RIOOOO when the problem size is reduced (n < 200). 



SUN UltraSPARC 




Fig. 3. Speed-up of the block-partitioned algorithm on the SGI MIPS RIOOOO 
(left) and the SUN UltraSPARC (right) processors. 



Figure 4 shows the efficiency of our parallel algorithm compared with the 
non-blocked and blocked algorithms using np = 2,4, ...,12 processors. These 
figures report the efficiency versus problem size on the SGI PowerChallenge 
and SUN Enterprise platforms. The blocked and parallel algorithm employ the 
optimal block size determined in the previous experiment, i.e., nb = 100 and 
nb = 32 for SGI and SUN, respectively. As these figures show if we compare 
our parallel algorithms with the serial algorithm (non-blocked) efficiencies higher 
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than 1 are obtained. On the other hand if the parallel algorithm is compared with 
blocked algorithm, the maximum efficiency is 80% and decrease as the number 
of processors of the system is increased, since the problem size is moderate. 



np=2 — H 
np=4 —X— 




Problem size (n) 





Problem size (ri) 




Fig. 4. Efficiency of the parallel algorithm compared with the non-blocked al- 
gorithm (top) and the blocked algorithm (bottom) on the SGI PowerChallenge 
(left) and the SUN Enterprise 4000 (right) multiprocessors. 



5 Conclusions 

We have presented block-partitioned and parallel versions of Petkov, Christov, 
and Konstantinov’s algorithm for the pole assignment problem of single-input 
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systems. Our block-partitioned algorithms achieve a high speed-up on SGI and 
SUN processors, while maintaining the same numerical properties. 

The experimental results of the parallel algorithms also show an important 
increase in performance on an SGI PowerGhallenge and SUN Enterprise plat- 
forms. 
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Abstract. The study of the solution of the Generalized Sylvester Equa- 
tion and other related equations is a good example of the role played by 
matrix arithmetic in the field of Modern Control Theory. We describe the 
work performed to develop systolic algorithms for solving this equation, 
in a fast and effective way. The presented results show that the design 
methodology used allowed us to propose the use of Systolic Libraries, 
that is, reusable systolic arrays that can be implemented taking profit of 
the use of EPGA technology. In this paper we show how it is feasible to 
solve the Generalized Sylvester Equation using basic modules of Linear 
Algebra that can be implemented on versatile systolic arrays. 



1 Introduction 

The Generalized Sylvester Equation, AXB + CXD = E, with A,C & 

B,D <E and X,E e and some simpler derived equations such as 

the Sylvester[7],[15],[3] Lyapunov [13], [17] and Stein [7], [15] have multiple and 
important applications in the field of Control Theory [9], [7], [15]. 

Obtaining the solution of these equations is a suitable problem for the ef- 
ficient use of parallel algorithms, due to the regular structure of the matrices. 
However, when real-time constraints apply to the system, the use of dedicated 
processors, usually implementing systolic algorithms in VLSI is required. We 
have recently presented several works [10], [12] showing that a modular approach 
to systolic algorithms is a suitable way of building fast, reconfigurable solutions 
to be implemented in FPGA devices to obtain cost-effective custom processors 
to solve different problems. 

The starting point is a new design methodology [10] based on the Kronecker 
Product and Vec-Function operators. Algorithms obtained this way are easy to 
parallelize because they consist of combinations of basic, widely studied opera- 
tions (Solve a triangular equation system, Gaxpy, Saxpy, QR decomposition of 
a Hessenberg matrix, . . . ), and the required data flow is well structured to pass 
from one functional block to another without intermediate storage. 
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Extending these results, we have compiled in a Systolic Library for Linear 
Algebra all the basic modules, following the same principle of modular program- 
ming that generated other sequential and parallel environments [1],[18]. For the 
modules of this library [11] to be useful to solve any problem in their application 
field, two restrictions hold: (1) all the systolic arrays must share a compatible 
data flow, to allow results from one of them be forwarded to another, and (2) the 
arrays must be designed to process problems of any size. These two restrictions 
have been satisfied using dynamic arrays and applying the DBT transformation 
[14] on the basic operations of the linear algebra. 

The application described in this paper is a good example of the use of 
the Systolic Library. The first step to solve the Generalized Sylvester Equation, 
following the method proposed by Golub, Nash and Van Loan [4], is transforming 
the original problem A' X'B' + C'X'D' = E\ into AXB + CXD = E using 
orthogonal similarity transformations on the pencils A' — \C and D' — \B' to 
obtain their Generalized Schur Forms (that is, (A — AC')P^ = A' — \C and 
Q\ {D — XB)Q 2 = D' — \B'). The coefficient matrices of the resulting equation 
are in a condensed form. We have worked on the solution for three cases [10]: 
first, when all of them are triangular ( Triangular Case). Second, when A is Schur 
or Hessenberg and the others triangular {Hessenberg Case). Third, when both 
matrices A and D are Schur {General Case). The study of the two first cases 
has made possible the development of the basic arrays; the study of the general 
case allowed us to prove how the collection of routines obtained were efficient 
(and sufficient) to solve more general and complex problems. 

Section 2 presents the basis of the methodology for developing the algorithms: 
the definition of Kronecker Product and Vector Function of a matrix. Section 
3 describes the main operations to be solved when studying the solution of the 
Generalized Sylvester Equation in the General Case. Then section 4 shows how 
to use the library to implement this operation. Finally section 5 concludes and 
presents the ongoing work. 



2 Applying the Methodology of Design 

The methodology used to solve the Generalized Sylvester Equation, described 
in [10], is based on the definition of the Kronecker Product and Vec-Function 
of a matrix. The properties of both operators [6] can be applied to simplify 
the structure of the problem. Concretely, by applying them to the equation 
AXB + CXD = E, the linear equation system (B^ ® A -|- ® C)vec{X) = 

vec{E), shown in figure 1, is obtained^. The resulting system, too huge to be of 
practical implementation, offers a clear representation of the data dependencies 
and a simple expression of the basic steps required to solve the problem. 

The structure, similar to an upper triangular system, suggests the applica- 
tion of the Back Substitution Algorithm to solve the problem. For example, an 
intuitive and simple method would be to obtain the value of and then update 

^ Assuming that the pencil D — \B has lower quasi-triangular structure: this affects 
only to the order of resolution and helps to visualize the problem. 
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Fig. 1. Linear Equation System obtained by applying the Kronecker Product 
and the Vec-function to the Triangular Generalized Sylvester Equation. 



the values of e„_i, . . . , ei as is done for the solution of a triangular system. The 
resulting procedure is shown in figure 2. 



Calculate Q: {Abj_ Q is upper triangular; 

Solve ( (Abj_j_+Cdj_j_)Q) 
w: = (AQ) * (q'^x^) ; 
v: = (CQ) * (q'^x^) ; 
x^:=Q* (q'^Xj_) ; 

for j:=i-l downto 1 do 

Update ej : =e j -wb^^ j ~^^i j 

endfor ; 



Fig. 2. SGH Step: Procedure to obtain Xi, assuming that c?i_iy = 0. 



But figure 1 also shows that for certain elements (for example X 3 ), that simple 
procedure cannot be applied because there are subdiagonal elements of matrix 
D ((^ 23 ) that produce subdiagonal blocks in the transformed matrix. It is then 
necessary to solve at once two columns of matrix X (x^ and 2 : 2 ). We will call 
this new operation Solve_2. Figure 3 shows the complete procedure to solve the 
equation. 

In the resulting SGG Algorithm all the operations but Solve_2 are basic 
operations of Linear Algebra and they can be directly performed on the arrays 
designed in the systolic library described in [11]. In fact, the SGH step is the 
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1 : =n ; 

while (i>0) do 



if (di-1, 1=0) 


then 


SGH 


step; 




i : = 


i-1 




else Ab 

Solve 2 1 




wl : 


=A*Xj^ ; 




vl : 


= C*Xj^; 




w2 : 


=A*Xi_i; 




v2 : 


=C*X._^; 




for 


j :=i-l 


downto 1 do 



/ V / " V / ' 



Update Sj : =e j - (wlbj^ j +vldj^ j+w2bj^_^ j+v2dj^_^ j) 

endfor ; 
i : =i-2 

endif 
endwhile ; 



Fig. 3. The resulting SGG Algorithm. 



basic stage of the Algorithm for solving the Hessenberg case [10]. Therefore, to 
continue with the study of the solution of the General case it is necessary to 
study this new operation. 



3 The SOLVE_2 Operation 

For the efficient implementation of the Solve_2 operation we start by analyzing 
the structure of its coefficient matrix, Ai; a possible example, assuming m=4, 
would be the following 



/ Oil ai2 Ol3 Ol4 fell fel2 fel3 fell \ 

0 U22 023 024 0 fe22 fe23 fe24 

0 0 033 034 0 0 bs3 fe34 

0 0 043 044 0 0 fel3 fell 

Cll Cl2 Cl3 Cl4 dll dl2 dl3 dl4 
0 C22 C23 C24 0 d,22 d,23 d24 

0 0 C33 C34 0 0 ds3 dsi 

V 0 0 0 C44 0 0 di3 di4 ) 

( 1 ) 

We have followed the proposal of Golub, Nash and Van Loan [4] to reduce 
the cost of triangularizing this matrix (O(m^) flops^). Applying to the problem 
a permutation matrix such that it transforms 1,2, , mn into 1, n + 1, 2n + 
1, . . . , (to — l)n+l, 2, n+2, 2n+2, . . . , (to — l)n+2, . . . ,n, 2n, 3n, . . . , (to — l)n, mn 
the result is an equivalent problem in which the coefficient matrix is an upper 

^ According to the old definition of flops [5], o[i] = o[i] + fe[i] * c[i\, to better compare 
the sequential algorithm with the systolic implementation. 



M = 



Afei_i^i_i + Cdi—i^i—i Abi^i—\ + Cdi, 
Cdi—\ A Aba + Cdi, 
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triangular matrix with two non-zero subdiagonals. Using that transformation in 
the example, the result is 
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ail 


ai2 


ai3 


ai4 
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b\2 
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bi4 A 
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C 44 
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C 44 
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(2) 



Different possibilities were considered when designing the corresponding al- 
gorithm to avoid the construction of the auxiliary matrix ZAP. Two were 
deeply studied due to their feasibility: 

1. To process JA as matrix {Abu + Cdu) in the SGH step. The basic idea in the 
procedure described in figure 2 is to look for a compatible data flow among 
the operations to allow a systolic implementation. Then the transformation 
to triangularize the coefficient matrix of Solve is applied by columns. In 
the systolic implementation the resulting data flow allows to obtain a good 
chaining between Calculate Q and Solve operations, that stands also for 
Solve and Gaxpy; and, moreover, there is no need to form an auxiliary 
matrix, working in terms of the original one. Our aim was also to keep the 
original matrices in the Solve_2 operation, following for the triangularization 
the reduction order imposed by the permutation of JA in eq. 2. The result 
was the design of a sequential algorithm, SGGl [10], of 0(5m^n-|-mn^) flops. 

2. To process At in a similar way to the Back Substitution Algorithm, obtaining 
the values of columns Xi and Xi-\ by groups of two elements (corresponding 
to zero subdiagonal elements of matrix A) or four elements (corresponding 
to non-zero subdiagonals entries of matrix A). That must be done due to 
the structure of JA in eq. 1. For non-zero entries of the original matrix A 
(for example elements 043 , 643 and ^43 ) a 4 x 4 system has to be solved, 
obtaining four values of and i — 1*^ columns of X. For entries whose 
value is zero, solving a 2 x 2 system two values of and i — 1 *^ columns 
of X. The corresponding sequential algorithm [10] has a temporal cost of 
0{rn^n + mu?) flops. 




( “22 i>22 t ( “=2,i-l 1 ^ ( “2,i-l A 
V '=22 ^22 / \ ^ 2 ,i j \ j 



3.1 Obtaining Systolic Algorithms for the Solve_2 Operation 

The previous resolution schemes present two major drawbacks for their systolic 
implementation: 
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1. For the first approach, the rotations involve columns in different blocks of 
the original matrices (marked in bold in eq. 2); therefore it is necessary to 
explicitly form all the linear combinations of all the blocks involved in the 
Update of other columns of matrix E. ft is impossible to form the auxiliary 
vectors wl, w2, vl and v2 to reduce the cost of Update. 

2. For the second approach data dependencies are so strong that we could not 
hnd an efficient systolic algorithm for it. 

Therefore, to design an efficient systolic algorithm for the Solve_2 operation, 
we studied the reuse of those obtained for simpler cases. When solving the Trian- 
gular and the Hessenberg case, two basic systolic arrays were designed [11]. The 
first one, called Module QR, has the capability of performing the operation 

Calculate Q : {aA + (3B)Q is upper triangular 

obtaining AQ, BQ and Q, and working with matrices of any size. The description 
is presented in figure A. li a = 1, A = A, (3 = Q and B = I, the outputs of this 
operation are AQ and Q. 






CALCULATE.R: 



N1N2N3N4 







SI S2 



if Control then else {Control = 0} 

ifN4 = 0then (RO) E3 :=N1 ; E4 :=N2 

E3:=l ; E4:=0 01:=N3*N1-E1*N2 

else ' 02:=N4*N1-E2*N2 

d;=sqrt(sqr(N4*N2)+sqr(El*Nl+E2*N2)) S1:=N3*N2+E1*N1 
E3:=(El*Nl+E2*N2)/d _ S2:=N4*N2+E2*N1 

E4:=(N4*N2)/d 
endif 



03:=N1 ;04:=N2; 

S1:=N3*E4+E1*E3; 

S2:=N4*E4+E2*E3; 

01:=N3*E3-E1*E4; 

02:=N4*E3-E2*E4 



APPLY R: 






i| |i 

I ' 



01:=N1*03-E1*04 
02:=N2*03-E2*04 
S1:=N1*04+E1*03 
S2:=N2*04+E2*03 
E3:=03 ; E4:=04 



Fig. 4. Module QR. 



The second one, called Module Solve/GAXPY, has the capability of si- 
multaneously performing the operations 

Solve {aA + /3B)x = e and w := A * x, v := B * x 

also working with matrices of any size. The description is presented in figure 5. 
If a = 1, A = AQ, (3 = 0 and B = Q, among the outputs of this operation we 
have V = X, obtained from x := Q{Q'^x). 

It is then possible to solve the General case of the Generalized Sylvester 
Equation using the SGH step when a subdiagonal entry of the matrix D is zero 
and using the following procedure when a subdiagonal element is non-zero: 
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sii_U I I i 






N1 N: 








GAXPY: 
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Fig. 5. Module Solve/GAXPY. 



1. Construct the 2m x 2m matrix P^Al-P, 

2. Construct the corresponding version of Identity matrix: starting from 

^ X m Im X m 

V Im X m Im X m 



apply on it the same permutation (assnming again m=4), 



P'^XP = 



/I 1 0 0 0 0 0 0\ 
11000000 
00110000 
00110000 
00001100 ’ 
00001100 
00000011 
\0 0 0 0 0 0 1 1 / 



(3) 



3. Using the Module QR (a = 1, A = P^AiP, /3 = 0 and B = P"^XP) nullify 
the two subdiagonals of matrix P^AI-P, (P^AdP)Q and obtain (P^ZP)(5, 

4. Using the Module Solve/CAXPY (a = 1, A = {P’^ MP)Q, /3 = 0 and 
B = (P^IP)Q) solve the triangular system and obtain Xi and Xi-i from 
the solution of the system, 

5. Using the Module Solve/GAXPY (A = A, B = C and any value for a and 
(3) calculate wl, w2, vl and v2 and Update the matrix E. 

This procedure can be entirely implemented with the proposed systolic arrays 
independently of the size of the coefficient matrices of equation AXB + CXD = 
E. 



4 Systolic Implementations for the General Case 

The basic stage of the systolic computation will be the obtaining of a column of 
matrix X, Xi, when = 0 (SGH step) or the obtaining of two columns of 

matrix X, Xi and Xi-\, when di-i^i ^ 0 (SGG step). 
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Figure 6 shows how to combine the two basic modules to solve the SGH 
step. In addition to the Module QR and the Module Solve/GAXPY it is needed 
a special cell, called GAXPY_2 to complete the calculus of w and v, accumulating 
on them the corresponding products with the subdiagonal elements of AQ and 
GQ (with the same zero-structure that matrix A); it is also necessary an array 
formed by SAXPY cells with capability of performing a Saxpy operation to 
update each column of matrix E. This update is made up with the value of 
vector Q"^Xi The figure does not show the calculation of Xi from this value, but 
it can been performed on the same array, introducing only the Identity matrix, 
the corresponding rotations and the vector 



,d3ib3i_eii 

— _d3^32_ei2 
_d3i_b3ie2i _*_* 

_d3^32_C22 
_d3i_b3ie3i _*_* 

_d3^32_C32 
_d3i_b3ie4i _*_* 

d3^32_C42 



_ei3_d,33b^O 

_C23_ d.33 b^a2^ * ^ ^2 



II 



C22 322 



C33 d33 b33 032 



g43 d33 b33 043 



N1 N2 

I I 



_Ci4 _8_ 

l'"tl 1. — 

_ 0_ 0_ _0 _6^ 

- L — 

0 0 0 4 




N1n2N3 



GAXPY_2: 

•El 01 :=E2 + N2*EI; 
•E2 02:=E3 + N1*E1 
•E3 



SAXPY: 

S:=N3-E1*N2-E2*N1 
El OI ;=E1 ;02 :=E2 



The computation of the SGG step is formed by the successive transformation 
(fig. 7) of the 2m x 2m matrix MP. To nullify the second subdiagonal, it 
is considered the matrix Aux, formed only by the (2m — I) first columns of the 
original matrix; that is, it is a Hessenberg matrix of size 2m x (2m — 1). When 
the subdiagonal has been nullified, the matrix Auxl, also of size 2m x (2m — I), 
is obtained. To form the matrix Aux2, it is necessary to add the last column of 
the initial matrix. Again, a Hessenberg matrix, of size 2m x 2m, is obtained and 
after the process, it is obtained Aux3, that is upper triangular. 

Figure 8 shows the complete process and the order in which each one of 
these auxiliary matrices is processed. Note that the modules are of size m, so 
the process supposes the application of the DBT [14] on these matrices. The 
DBT of the operation Calculate Q will be more widely discussed in subsection 
4.1, but note that, in figure 7, the matrices are cut in blocks of the size of the 
arrays in a special way, making two blocks share a column. 
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(a) Initial matrix (b) Matrix Aux (c) Matrix Auxl (d) Matrix Aux2 (e) Matrix Aux3 
Fig. 7. Successive transformation of the matrix P^MP. 





Fig. 8. Successive steps in the calculation of Xi and Xi-i and the Update of 
matrix E. (Note: References to blocks of the Identity and Q matrices really refer 
to blocks of matrices P^IP and {P"^IP)Q respectively). 
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4.1 Size-Independent Systolic Implementation 

Let us suppose that the blocks system of figure 1 is made up by x upper 
Schur blocks Abij + Cdij, of size M x M, and each block is built oi qx q blocks 
of dimension m x m, being N = pn and M = qm. Let us also suppose that 
each of the columns of X and E will be built of q blocks of size m. According 
to this block structure, we will identify the subblock at the r row and s column 
from the [Abij + Cdij) block with the notation {A^^bij + C^^bij); and the 
subvector from the column of X, Xi, or E, Cj, will be written or e[. This 
block division will be used to develop a block oriented process to solve the 
Generalized Sylvester Equation; the described situation allows the decomposition 
of operations Solve, Gaxpy and Update to process blocks of size m x m. To 
decompose the operation Calculate Q (and Apply Q) it is necessary to realize 
that there can exist subdiagonal elements in the matrix {Abu + Cdu) that do 
not belong to any block. In order to nullify them, the block division for this 
operation is similar to the one depicted in figure 9: two consecutive blocks in a 
row, {A'^^bii + C'~‘^dii) and + C^’^~^^du), share a column, in such a way 

that we can calculate and apply the corresponding rotations. 




Fig. 9. (a) Block division for the Solve and Gaxpy operations, (b) Block division 
for Calculate Q and Apply Q operations. 



Also, to perform the Update operation, the following block division for the 
matrix E and the row of matrices B and D must be considered: 

/ Ell Ei 2 ■ ■ ■ Eip \ 

_ E21 E22 ■ ■ ■ E2p 1 = (foil bi2 ■ ■ ■ biK bi^K+l ) 

I ’ = {dii di2 ■ ■ ■ diK di^K+i ) 

\ Eql Eq2 ■ ■ ■ Eqp j 

Let us assume K=((i-1) DIV n) and L=((i-1) MOD n). Each block is of 
size (to + 1) X n, and shares a row with the corresponding block Ei+ij. Each 
subblock of the row of B and D has n elements, except for the subblocks 
bi^K+i and di^K+i which have L+1. 

To solve the problem in the size-independent case, the Dense-to-banded 
Transformation, DBT [14], has to be applied to the non-triangular submatri- 
ces involved in the process. The DBT obtains, from a matrix of size to x to. 
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another one of size m x 2m or 2m x m, but with bandwidth m, by the adequate 
juxtaposition of the upper and lower triangles of the matrix. In the present prob- 
lem it is necessary to find a common DBT to all the operations, so the second 
possibility must be chosen. 

As in the size-dependent algorithm, the basic stage will differ depending 
whether it is found that is zero or not. When = 0, the basic stage is 
the calculation of xf, shown in figure 10. This process is divided into two steps. 
First the obtaining of Then, two different operations on different data 

are required: the Apply Q and Gaxpy operations to preprocess the w and v 
vectors for later stages, and the update of E with regard to the calculated value. 
It is supposed that when obtaining {Q^)'^xf the control signal is kept high in the 
QR and Solve/GAXPY modules; afterwards it goes low to start the preprocess, 
which is developed simultaneously with the updating on the n SAXPY cells 
array. In the Update operation they will be involved the first L subcolumns 
of the Es^k +1 block and the K first blocks (from Esk to Egi). During this 
operation, the 0(n) array has to receive as inputs the required K copies of w® 
and u® to complete the calculation. To do that, we can use the GAXPY_2 cell: 
depending of the value of a control signal (independent from the signal managing 
GALGULATE_Q and SOLVE cell) it selects inputs to the GAXPY array from 
the SOLVE cell or from memory. 




Fig. 10. Data flow for solving xf, m=4,n=3. 
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When di-i^i ^ 0 and the Solve_2 operation must be block oriented, the matrix 
M must be also divided into blocks; the notation to be used will be: 

ijrs _ f + C^^di^i-i\ , , 

“ ^ A^^bu +"* Cdu J ^ ^ 

In this case the basic stage will obtain and a;®. Blocks are introduced 
in the order suggested by figure 10, but taking into account that each diagonal 
block is processed as shown in figure 8 and each dense block as shown in figure 11. 
Once Xi y Xi-i have been obtained, blocks of matrices A and C are introduced 
into the array in the order shown by figure 10 to complete the update of blocks 
Esk, ■ ■ ■ , Esi while obtaining wl®, w2®, ul® and u2®. 

This theoretical scheme could be optimized in the systolic implementation by 
overlapping stages, taking profit of the 2 — slow data flow as well of the existence 
of operations without data dependencies (for instance, in Solve_2 during the 
Update of matrix E or, if two consecutive Solve_2 have to be applied, the Update 
part of the first can be delayed until the beginning of the second, increasing the 
efficiency of the SAXPY cells). 

5 Conclusions and Future Work 

We have shown how the Generalized Sylvester Equation and its derived equations 
can be systematically solved, using systolic blocks that perform basic operations 
of the linear algebra, and that form a complete Systolic Library. This method of 
solving these equations has been obtained by means of a new design methodology. 
Its main advantage is the modularity of the obtained solution, that allows to 
apply the same design principles used in software development. The methodology 
has been applied to other equations derived from that, in the shown cases and in 
the case of A being a Hessenberg matrix [10], and all of them can be solved with 
the basic arrays described in this paper. These results have been used to design 
a complete Systolic Library [11] with the capability of solving a wide variety of 
problems in the field of matrix algebra. 

The work is being further extended in three different directions: the identifi- 
cation of others fields to apply the same design methodology, the implementation 
of the Systolic Library in FPGA devices and the automation of the process to 
directly obtain the FPGA configuration from the high level specification of the 
problem. 
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Chapter 4: 
Imaging 



Introduction 



Michael Duff in his talk Thirty Years of Parallel Image Processing presents us 
his views on the history of parallel image processing, providing at the same time 
an introduction to the three papers on this same subject, making up the present 
chapter. 

The work by Jorge Barbosa et ah, the recipient of the First Prize for the 
Best Student Paper Award, discusses a parallel system for image processing on 
a cluster of personal computers with applications in medicine. 

Sousa and Sinnen are concerned with a study on the design and analysis 
of a nonlocal image processing parallel algorithm for orthogonal multiprocessor 
systems; the algorithm is applied to typical nonlocal image processing, as for 
instance image rotation and Hough transformation. 

The last paper, by Dantas et ah, presents a track reconstruction algorithm 
for experiments in high-energy particle colliders. 
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Abstract. The history of the development of parallel computation 
methodology is closely linked with the development of techniques for the 
computer processing of images. In the early 60s, research in high energy 
particle physics began to generate extremely large numbers of particle track 
photographs to be analysed and attempts were made to devise automatic or 
semiautomatic systems to carry out the analysis. This stimulated the search for 
ways to build computers of increasingly higher performance since the size of 
the image data sets exceeded any which had previously been processed. At the 
same time, interest was growing in exploring the structure of the human visual 
system and it was felt intuitively that image processing computation should bear 
at least some resemblance to its human analogue. 

This review paper traces the simultaneous progress in these two related lines of 
research and discusses how their interaction influenced the design of many 
parallel processing computers and their associated algorithms. 



I. Thirty Years Ago 

Image Processing was originally regarded as a subset of the wider field of Pattern 
Recognition which dealt with the analysis and processing of patterns in sound and 
other signal sources such as ECG and EEG as well as images. In all these areas, the 
research was mainly application driven. A three-day meeting in London in 1968, 
organised by the Institution of Eleetrical Engineers and entitled ‘Conference on 
Pattern Recognition’, comprised 37 papers. Of these, approximately one third were 
devoted to Optical Character Recognition (OCR) and a quarter to the physiology or 
psychology of human vision; the remainder was distributed more or less equally 
between studies of learning algorithms, speech recognition and general problems in 
pattern recognition. At this early stage, although it was realised that the principal 
application, OCR, would eventually demand much higher processing power than was 
currently available, the lack of effective algorithms meant that research was directed 
towards how to recognise images rather than to doing so at economic speeds. 

Even so, what was not realised was how difficult the task would be. There was a 
quite unjustifiable optimism amongst researchers which could probably be excused by 
the fact that everyone eould observe in action (and, in fact, owned) a very effective 
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image processing system which was portable, low power, high resolution and able to 
work in an unconstrained environment. Colour analysis, stereoscopy, time sequence 
analysis, automatic compensation for high or low light conditions, rotation invariance, 
image fragment recognition, learning capability: the system coped with all these 
difficult aspects. Unfortunately, it was considered that a combination of intuition and 
introspection would somehow reveal how the human vision system was constructed 
and that this knowledge could then be translated into an appropriate combination of 
hardware and software. This would amount to more than a PhD project but certainly 
should not take as long as ten years to complete. 

In this optimistic atmosphere, there were two factors which stimulated an interest in 
faster computation. First, it seemed likely that useful algorithms would soon be 
developed and that computers would then need to be made much more powerful in 
order to achieve acceptable processing rates. Second, the progress being made in 
designing algorithms was poor, at least in part due to the inefficient computing 
services currently available. For example, at University College London in the early 
60s, a large mainframe machine (IBM 360) provided the central computing service. 
Programs and even test images were entered via punched cards and then batch 
processed. Typically, a print-out of the results, using overprinted characters to 
represent image intensities, would be obtained on the following day; any small 
programming error (such as an unwanted comma) added a further day's delay to the 
program development time. In this virtually non-interactive environment, thinking 
constructively about algorithm design was almost painful. 

Optimistic or not, almost all who were engaged in image processing research agreed 
that faster computers would, sooner or later, need to be developed and that there 
would be an immediate advantage if eomputing speeds could be improved. The 
important question was: how could a speed gain be achieved? 



2. Faster Computing 

From the outset, it was clear that there were only three ways to speed up computing. 
They were (and still are): 

a) More efficient programming; 

b) Use of faster components; 

c) Improved system hardware architecture. 

With large data sets to be processed, it is extremely important to optimise the pieces 
of code in the so-called inner loops. For example, if the intensity of every pixel in an 
image is to be averaged with its neighbours, then the code performing the averaging 
may be executed a million times in a typical size image. Any wasted operations in 
that section of code will severely affect the overall efficiency of the program. It goes 
without saying that experienced programmers would not be expected to make this sort 
of error. In general, it would be hoped that most of the gains which could be obtained 
by efficient programming would normally already have been made. 
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Speeding up computers by using faster components is a continuous process of 
technological development which is largely under the control of computer 
manufacturers. In the period we are discussing, computing component technology 
moved from thermionic valves, through transistors to integrated circuits, having 
already progressed from mechanical (gear wheels) and electromechanical (relay) 
computation. In the last phase, integrated circuits have also undergone massive 
improvements in level of integration (numbers of components per unit area) and 
semiconductor technology, both of which have produced enormous speed gains. For 
the typical researcher, access to the best available circuit components has usually been 
a matter of cost since all new devices tend to be prohibitively expensive when first 
introduced. 

The third approach is to redesign the computer architectiue. The underlying structure 
of all computers was once much the same: there was a store for instructions, a store 
for data and a processor which was controlled by instructions extracted from the 
program store. These acted on data from the data store, producing a result which was 
returned to the data store. There were also units which input and output data and 
programs. A master controller ensured that all these operations were correctly 
sequenced. This extreme oversimplification hides all the ingenuity which went into 
making these basic operations efficient and transparent to the programmer. 

Starting with this fundamentally simple architecture, the challenge was to make 
changes which would improve performance not marginally but substantially, ideally 
by many orders of magnitude. This was the impetus behind the introduction of 
Parallel Processing. 



3. The Concept of Parallel Processing 

Many hands make light work is a well known saying, but then so is Too many cooks 
spoil the broth. The fact is that increasing the size of the work force does not 
necessarily reduee the time (or cost) for completing a task. The introduction of 
additional labour implies a degree of organisation and co-ordination and may also 
require the task to be split up into manageable portions. The overhead for 
organisation can be more than the time saved and the task may not respond well to 
division. Flow often does one hear the comment: "I don't think you can help me; it 
will be quicker if 1 do it myself!"? 

The central challenge in the design of parallel computers is to assemble many 
computers (or processors) into a system which will then share the execution of a 
program in such a way that the time between the start and end of the whole process is 
reduced. Ideally, if N computers are used to execute a program then the exeeution 
time Tn should be (l/N)Ti, where Ti is the time taken by a single computer to execute 
the same program (suitably rewritten for a single eomputer). In practice, this ideal is 
seldom achieved, the exception being in computers designed for specific algorithms. 
A crude measure of efficiency of a parallel architecture is Ti/(NTn), but, as will be 
discussed in more detail later, this measure will depend on the program being 
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executed, both in relation to the task being performed and to the skill of the 
programmer. 



4. Classifying Parallel Architectures 

In general, a parallel computer will consist of an assembly of simple computers, 
usually referred to as processing elements (PEs). Each PE may be extremely simple, 
perhaps only capable of processing single bit data, but might alternatively be 
complex, sueh as a PC. There will usually be memory assigned to each PE and an 
interconnection network, both for transmitting data between PEs and for supplying 
instructions to the PEs. Some systems operate under the control of one master 
computer whereas others assign partial or even total autonomy to each PE. 

In the past three decades, much has been written about the many different 
architectures of parallel processing computers and many attempts have been made to 
devise a taxonomy for classifying the architectures (e.g., see [9]). The best known 
attempt was by M J Flynn [8] whose elassification was based on whether the data 
stream was single or multiple and on whether the instruction stream was single or 
multiple. Of the four possible classes, the one that most aptly fitted a representative 
group of parallel processing computers (several of whieh were actually constructed) 
was the SIMD class: an array of simple PEs all simultaneously executing the same 
instruction (Single Instruction stream), but each operating on its own part of the data 
(Multiple Data stream). However, despite the fact that the paper describing this 
taxonomy has been quoted in the literature more than has any other on this topic, this 
division of parallel processors into four classes is so crude as to be virtually useless. 
Many parallel systems either do not fall convincingly into any of the classes or else 
equally well fall into more than one. Furthermore, the first class (Single Instruction 
stream. Single Data stream) refers to serial computing so can hardly be treated as part 
of the taxonomy. 

It is therefore not unreasonable to ask why researchers persist in attempting to devise 
classification schemes. There are probably two main reasons: 

Divide and conquer Computer scientists (and others) have experienced great 

difficulty in understanding the underlying principles of parallel processing systems 
and it can be a help if the structure of each system is compared with one of several 
archetypes: a form of learning by analogy; 

Establishing design objectives Parallel computer designers need to be clear what 
their strategy will be when designing a new system. It can be a useful design 
discipline to encapsulate a strategy by naming and defining the broad principles 
governing each particular design. 

For the remainder of this review, classification schemes will not be considered, 
especially as there is now little or no agreement as to which scheme should be 
adopted. 
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5. Parallel Processing Fundamentals 

5.1 Three Level Processing 

It is easy to state in imprecise terms what is required of any parallel processing 
system. It is a system which, by employing more than one processor, completes a 
data processing task faster than could be achieved by a single processor. In order to 
investigate parallel architectures, the following discussion will concentrate on the 
particular problems associated with image processing. Examining the problems in 
detail, certain significant factors begin to emerge: 

Data type Image data usually consist of large regular arrays of square picture 

elements (pixels), each of which represents the local brightness and, possibly, colour 
of the image. Typically, each pixel is assigned a 1-bit integer (black and white so- 
called binary images), an 8-bit integer (grey-level images) or a 24-bit integer (colour 
images). An image of approximately domestic television resolution (512 x 512 
pixels) comprises rather more than one quarter of a million pixels. Very many image 
processing operations involve replacing each pixel by a new pixel whose intensity is a 
function of the intensities in a defined neighbourhood, for example, the 3x3 pixel 
region surrounding each pixel. This implies that an image processing operation can 
involve over 2.5 million basic operations (each requiring fetching data from memory, 
computing a sum or product and then storing the result in memory). The need for fast 
processing is self evident. 

Computation type It is clear that the highly repetitive nature of the elements 

of the image processing computation might offer potential for structuring a computer 
architecture so as to take advantage of the repetitiveness. 

Unfortunately, this brief analysis of image processing greatly oversimplifies the 
situation. Conventionally, the complex task of image processing is divided into three 
stages or levels [23]: 

a) Low level processing which is characterised by taking in one or more 
images, processing them and outputting one or more result images. In general, the 
dimensions of the input and output data arrays will be identical; 

b) Intermediate level processing in which the input data will be one or more 
images (input from the low level processing stage) and the output data will be one or 
more dimensionally smaller data sets, such as lists of detected object features and 
global properties of the image (e.g. average intensity, histograms, contrast range). 

c) High level processing which attempts to extract meaning from the 
intermediate level data with a view to describing and analysing the input image. The 
output data might be as small as a single word or sentence. 
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5.2 Processor Arrays 

As was discussed earlier, many low level image processing tasks can be broken down 
into identieal short sequences of basic operations, each centred on every pixel in the 
image. An image architecture closely matching the apparent requirements of this 
level would therefore be an array of very simple processors, each associated with a 
single pixel and each accessing data only from its own local memory or from the 
neighbouring set of pixels. The repetitive nature of the processes to be performed 
would permit broadcasting a sequence of instructions to each simple processor (PE), 
the instructions being then executed simultaneously by every PE. This is the classic 
SIMD architecture. 




Fig. 1 . A4 X 4 PE array, showing the interconnections between PEs and the bus 
distributing instructions in parallel to each PE 



Apart from the paths taken by the instructions, all communication paths in the 
array are short (i.e. to nearest neighbours), provided that local memory is associated 
with every PE. One further set of longer paths is needed to input or output data to the 
memory array but these could be routed along the instruction highway. Fig. 1 
illustrates the main features of a 4 x 4 PE array 

Architectures of this type would appear to be ideal for low level processing but 
present many difficult problems in software design. Nevertheless, it can be shown 
that arrays of very simple PEs are theoretically capable of performing all image 
processing operations (even including those classified as intermediate or high level, 
although these might not be exeeuted very efficiently). 
















Thirty Years of Parallel Image Processing 



425 



One or more loosely coupled conventional processors can efficiently handle high 
level processing. There is no general pattern to the type of operations to be performed 
nor to the various types of input data set and the fastest available high speed 
workstation or even PC would usually offer the best solution. The same computer 
would probably be used to control the other two levels of the composite system. 

The most difficult stage to implement is the intermediate level. By definition, the 
input data impose requirements similar to those for the low level but the need to 
abstract information derived from all parts of the image (or images) implies the need 
for efficient connection paths across the whole of the image array. It would also seem 
likely that an array of simple PEs would not represent an ideal structure for 
computing histograms and other results contained in comparatively small data sets. 
Optimisation is therefore difficult and likely to be specific task dependent. 

A further problem resulting from the splitting of the low and intermediate levels is the 
difficulty in transferring the multiple image data between the two levels. Unless this 
can be achieved using many parallel paths, ideally one for each pixel, then this 
process might prove to be the bottleneck for the whole system. 

Taking these two factors into consideration, there would seem to be good arguments 
for recombining the low and intermediate levels, enhancing the low level structure by 
adding good communication paths between all parts of the array of PEs. 

In summary, the final assembly would comprise just two levels: the low/intermediate 
level would be an array of PEs, one per pixel for the size of image to be processed, 
and the high level/controller would be a conventional workstation or high 
performance PC. 



5.3 Pipeline Processors 

In the discussion in the previous section it was tacitly assumed that the task presented 
was to process a single image. Parallelism was achieved by assigning PEs to each 
part of the image data (i.e. to each pixel). An alternative approach can be adopted 
when many images are to be processed in a sequence. Under these circumstances, 
each processor is given a particular operation to perform and the sequence of images 
is fed through a string of processors, the output for the one providing the input for the 
next. The processors thus constitute a pipeline and the parallelism is now function 
parallelism rather than data parallelism (as was employed in the processor array). 
Sternberg has built and marketed several pipeline processors (named Cytocomputers) 
and developed complex software to program them [22]. 

In passing, it is interesting to note that this type of computer might also be classified 
as SIMD in that each PE executes a single instruction on multiple data, although in 
this case the data is multiple in time rather than position. In that the Flynn system of 
classification appears not to distinguish between these two very different 
architectures, it would seem to be of little practical use. 
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Function 1 Function 2 Function 3 Function 4 



Fig. 2. A short pipeline processor with 4 PEs and a master controller 

Because the operations each PE performs on the image as it passes through it can be 
quite complex, a pipeline PE will usually be much more powerful than those utilised 
in processor arrays. A further consideration is that cost and program structure 
combine to make it unprofitable to construct very long pipelines; instead, it is more 
efficient to cycle each stream of images several times through the pipeline, 
reprogramming the PEs to perform new operations after each pass. Whether or not 
this is done, there is always the disadvantage that the so-called latency of the pipeline 
(the time delay between an image entering the first PE in the chain and the time it 
leaves the last PE) may be inconveniently long. For example, although a 100 PE 
pipeline might output fully processed images at a rate of 10 per second, the latency in 
the chain would be 10 seconds, thus ruling out such a system for real-time processing 
as might be required in a 'visually' controlled machine. 

Other disadvantages are the difficulty in feeding forward partially processed images 
(to be used in combination later in the chain) and the virtual impossibility of handling 
feedback (when the parameters of the early stages of processing have to be adapted to 
the results of later stages). 



5.4 MIMD Arrays 

A third approach to parallel image processing makes use of a relatively small set of 
loosely coupled, powerful PEs, each capable of independent operation. A typical 
number would be 64 or less and the PE might be a microprocessor or even a PC. In 
principle, the image processing task is shared between all the PEs which then 
communicate over a high speed bus or some more complicated network. Each PE 
will have its own program store and substantial local memory whereas the system as a 
whole will usually be arranged so that one PE acts as a master controller and a major 
block of memory can be accessed by all the PEs. The classification Multiple 
Instruction stream. Multiple Data stream is clearly applicable since each PE executes 
its own program on its own part of the data. 
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Data Bus 




Instruction Bus 

Fig. 3. Simple MIMD system with three PEs (each with local memory) and a master controller, 
together with a common memory block 



MIMD systems have not made mueh impact on image processing. Just as employing 
more staff will not necessarily get a job done more quickly, so it has been found that 
adding more PEs to an MIMD system does not always result in faster processing 
speeds. Indeed, the additional overhead resulting both from subdividing the task and 
from communicating between the PEs can even result in a reduction of performance 
as more PEs are incorporated. The most serious objection to MIMD systems is that 
they are very difficult to program. Compilers which will efficiently segment the 
processing task into blocks, which will not leave PEs idle for much of the time, rarely 
exist and, in any case, seldom achieve efficiency over a range of different 
applications. It is therefore left to the programmer to decide how to employ the 
parallelism and this will imply that the programmer must know much more about the 
structure of the hardware system than is normal for software designers. 



5.5 Special Purpose Devices 

Faced with apparently insuperable difficulties in producing fast, efficient, general 
purpose image processing computers, some designers have tackled the more 
achievable challenge to design special purpose circuits which perform a very limited 
range of operations. For example, in some applications, an image transformed so that 
only the edges of objects are displayed (as white lines on a black background) can be 
useful. Another application needs to isolate only those parts of an image which are 
changing, perhaps because an object in the scene is moving. 

Some of these devices combine a retina-like array of optical detectors with a matching 
array of hard-wired logic elements; other use a sequence of hard-wired processing 
units in a pipeline configuration. In today's jargon, systems such as these could be 
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called smart cameras but their smartness is strictly limited and, somehow, 
disappointing. 



5.6 Summary 

There have been many approaches to parallelism in computers designed principally 
for image processing. The precise form of parallel architecture chosen is likely to 
depend on the range of tasks to be tackled. Thus, systems to be used for real-time 
control based on television cameras will almost certainly not be applicable to batch 
processing of large numbers of images collected by astronomers. Again, devices for 
motion detection would have no place in a pathology laboratory dedicated to cervical 
smear analysis. 

Parallel processing systems cannot be neatly categorised and it is doubtful whether 
there would be any value in doing so at this stage in their development. For those of 
us who have spent much of our working lives studying and designing such systems, it 
is discouraging to have to admit that the need for parallel systems in image processing 
has fallen to a low priority. The current obstacle to progress is the lack of effective 
algorithms; workstations and the latest generations of PC are usually quite fast 
enough for anything that needs to be done. 



6. Historical Background 



6.1 Pioneer Research 

Blindness is a terrible affliction. Most of the human environment is designed or has 
been adapted on the assumption that we can see and the vast majority of tasks 
performed by humans rely on human vision to provide the necessary feedback to 
control performance. Without the gift of vision, humans are greatly restricted in what 
they can do. 

In the same way, the development of sophisticated automation, especially in the 
manufacturing industry, has been retarded by the lack of competent computer vision 
systems. This is particularly serious with respect to inspection of manufactured parts 
and similar problems occur in medicine in the areas of mass screening; the subject of 
optical character recognition has already been mentioned in this review. Pure science 
would also benefit if it were possible to automate the analysis of photographic images 
produced in many research areas, high energy particle physics and astronomy being 
the earliest of these to generate this requirement. 

The inadequate performance of even the fastest available computers in the early 
1960s (when the demand for computer vision was beginning to become apparent), 
stimulated computer scientists to turn their attention to the research that was then in 
progress investigating the mechanisms underlying biological vision. Two seminal 
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papers in this area were the study of frog vision by Lettvin et al [14] and a slightly 
later paper by Hubei and Wiesel on cat vision [13]. Herscher and Kelley embodied 
the ideas behind the first paper in a hardware demonstration [12]. 

In the studies of both the frog and the cat, the anatomy of the visual system was seen 
to embody an array of photodetectors (rods and/or cones) forming the retina with the 
electrical outputs of the photodetectors being cross-connected, effecting both 
summation and lateral inhibition (i.e., a strong output from one photodetector reduces 
the strength of the output from its neighbours). The modified outputs are fed via a 
bundle of nerve fibres (the optic nerve) through to the visual cortex of the brain where 
layer upon layer of densely interconnected neurons carry out parallel logic operations 
on the retinal outputs. In the case of the frog, only a very small number of image 
properties can be extracted from the optical data, such as detection of an object 
moving into the field of view. However, the cat's visual system is very similar to the 
human's and is therefore capable of greatly sophisticated scene analysis. In all these 
studies, the anatomical investigation was supplemented by physiological 
measurements and much was learnt about how the systems effected their processing. 

At approximately the same time as this work was started, Unger published the first of 
his two papers [24], [25] proposing a processor array, although he did not build an 
array himself; in fact, these papers seem to be his last contact with the subject of 
computer architecture. His papers described a theoretical, square array of simple 
logic elements, each of which could receive data from or send data to any of its four 
neighbours. He demonstrated that his array could execute simple but useful functions 
on arrays of data of the same dimensions as the logic array but he did not suggest how 
these logic elements could be implemented in hardware. Fortunately, the Unger 
papers served to inspire others who then did construct hardware based on the ideas he 
had expressed. Another pioneer was Golay [10, 11] whose processor proposals, 
although conceived as a serial device, were turned into hardware by Preston [16] who 
was well aware that a more parallel version could have been constructed. 

Computers whose designs were based loosely on Unger's ideas were, in order of 
construction, Solomon [19], ILLIAC III [17], ILLIAC IV [1],[21] and DAP [7]. It is 
not clear whether Solomon was actually constructed and made to operate but ILLIAC 

III caught fire before it could be completed and only 'worked' in simulation. ILLIAC 

IV was only partially completed but sufficient was built to enable it to carry out many 
large-scale computations. DAP started being developed in 1973, was prototyped in 
1976 and put into commercial production in 1980. The last machines in this sequence 
were MPP [2] which first appeared in 1983 and the Connection Machine [13] which 
later evolved into the commercial CM series of massively parallel processor systems. 

Parallel processing research in the Image Processing Group in the Department of 
Physics at University College London (UCL) was initially influenced by the 
biological papers listed above. The research into parallel processing followed some 
seven years of development of semi-automatic microscopes and other image analysis 
equipment (1958-1965), constructed for the three High Energy Particle Physics 
groups in the same department. Unger's paper was not seen by the UCL group until 
many years later and it was surprising to see how the two disconnected lines of 
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research had by then converged. At this time, another field of researeh was also 
coming into being: Neural Networks. The pioneer work here was carried out by 
Rosenblatt [20] who devised the Perceptron. This circuit loosely simulated a neuron 
and introduced the idea of constructing circuits which could be trained to make 
decisions by adjusting the values of certain circuit elements (usually variable 
resistors) in response to a set of training patterns. The strengths of selected pattern 
features were translated into voltages which were then summed through the variable 
resistors (one for each feature). The resulting summed voltage was then compared 
with a threshold voltage and the pattern classified as class A (sum at or above the 
threshold) or class B (sum below the threshold). If necessary, the automatic trainer 
then adjusted the variable resistors appropriately to correct the decision and a new 
pattern was presented. It could be shown that the process would converge so that, 
ultimately, all the circuit's decisions were correct for the training set and would 
generally be correct for similar but previously unseen patterns. This work was also 
influential on the UCL programme. 



6.2 Research at UCL 



UCPRl 

A research grant application written in 1965 to request support for the UCL research 
programme is of interest. It could be submitted even today with very little 
modification since it addresses problems which are relevant to the design of parallel 
processing systems and are still unsolved: 

'One of the main limitations on the design of neuron-like networks 
has been the prohibitive cost of constructing circuits which involve 
very large numbers of circuit elements together with a high degree 
of interconnection between the elements. If these limitations were 
to be removed by exploiting some of the relatively new techniques 
for the production of microminiature circuits, then it might prove 
possible to develop networks which would embody some of the 
considerable analytical facilities of neural nets. In addition, the 
increased component density would permit a measure of 
redundancy, and local failure would not impair efficiency of the net. 

This, in its turn, would allow the use of circuit construction 
techniques which do not produce component values within close 
tolerances. 



The last part of the proposed programme which has been 

envisaged would comprise: 




Thirty Years of Parallel Image Processing 



431 



a) The construction of transistor models of neural elements with a 
view to producing a critical survey of their properties and to 
designing improved elements; 

b) Assembly of such elements into various arrays, exploring the 
numerous modes of interconnection; 

c) Simulation of such networks by means of computer programs, and 
developing appropriate mathematical methods to handle the logical 
circuit analysis; 

d) Translation of the circuitry into microminiature components, and 
utilisation of circuit replication techniques 

e) 



Application to the UK Department of Scientific and Industrial 
Research for support for a research programme entitled ‘Pattern 
Recognition Matrices’, dated March 1965' 



This programme resulted in the construetion and demonstration in September 1967 of 
UCPRl [4]. Integrated eireuits were not generally available at this time so the active 
components in UCPRl were diodes and transistors. Regions of interest in 
photographs of the tracks of high energy charged particles (in nuclear emulsions and 
from cloud chambers and bubble chambers) are characterised by either a sharp change 
in direction or by a branching of the track into two or more components. Automated 
scanning equipment had been built which needed manual centring on these regions so 
UCPRl was designed to show the possibility of making a retina-like device which 
would detect the regions automatically. 

The input to the system was a square array of 256 photodiodes onto which the track 
photograph was projected. Hard-wired circuits were layered under the photodiodes 
and performed the following functions: 

1 . Amplification; 

2. Summation over a 3 x 3 window surrounding each photodiode; 

3. Non-linear amplification of the summed output, saturated by at least two out of 
the nine possible inputs 

4. Summation over the outer edge of the 5 x 5 window centred on each amplifier 
output 

5. Comparison of the final output with a variable threshold, scanned from a high 
value downwards and designed to locate the maximum summed output. 

6. The threshold scanner stopped scanning as soon as a maximum was detected and 
the final layer outputs were fed to a 256 x 256 array of light bulbs. The bulb or 
bulbs which lit indicated the position of the detected region of interest (referred 
to as a vertex). The variable threshold scanned at 50 Hz so vertices could be 
detected in real-time, i.e., once every 20msec. 
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Fig. 4. A working demonstration of the parallel processor UCPRl, as demonstrated at the 
Physical Society Exhibition in London in 1967. The lamp at the top illuminates a track 
chamber photograph placed over an array of photodiodes. The electrie lamp array to the right 
shows the location of vertices in the photograph. 



The Diode Array 

UCPRl achieved what it set out to do: it successfully detected and located vertices in 
charged particle track photographs. A small piece of extra hardware showed that it 
could also be used to deteet ends of lines and a further extension enabled UCPRl to 
recognise carefully drawn alphanumerics (but not the complete alphabet) by analysing 
the locations of ends and vertices. The obvious weakness of the UCPRl concept was 
that each layer of processors could only execute a single logic function. In effect, 
UCPRl was unprogrammable. 

The Diode Array project [5] was the first attempt to determine what was the simplest 
specification for a processing element (PE) that would enable it to be programmed to 
perform all possible functions on arrays of data. Consideration of the experience 
gained in studying UCPRl and also taking into account what was then known about 
the construction of the mammalian retina, led to the proposal that eaeh PE should be 
able to input, store and output single-bit data, should be capable of inverting data and, 
finally, should be connected to neighbours in such a way that data from neighbours 
could be input as a logical OR of all four inputs. 
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Fig. 5. A single PE of the Diode Array, showing a neon indicator (ON or OFF for one or zero 
outputs) and a double-pole, double-throw switch to allow zero or one to be entered 



The basic PE is shown in figure 5. It includes a neon bulb which glowed to show a 1 
output (dark for a 0 output) and the points labelled A to J and + were initially left 
unconnected. The double pole switch was used to input a 1 or a zero (corresponding 
to its ON and OFF positions). A small 5x5 array was eonstructed and additional 
electromechanical relay circuits added to enable the user to systematically connect 
together various combinations of the labelled circuit points, the same combination in 
each PE. Treating the switch state as representing black and white image data, it 
could be demonstrated that functions such as image inversion, object edge extraction 
and object expansion and shrinking could be effected. 

A computer simulation of the array was written together with a Monte Carlo program. 
This applied a wide range of random intra-processor connections (between the 
labelled points) with a view to discovering all differing image processing operations 
which could be implemented by the array. The otherwise exhaustive search was 
narrowed by eliminating obviously useless eonnections, sueh as eoimecting the 
positive voltage supply (+) to Earth. Rerunning the program many times established 
the existence of more than 70 processing functions. For reasons that are not clear, 
those functions which had been built into the hardware array were discovered by the 
Monte Carlo program earliest in its operation. 

Because the connections between PEs were combined by OR-gates to provide a single 
input into neighbouring PEs, the array had no 'sense of direction'. For example, it 
would never be able to detect that one object lay above another in an image. Also, the 
obsolescent hardware components used to construct the array imposed undesirable 
constraints on the implementation of the logic functions. The next stage in the 
research programme utilised first small scale, next medium scale and finally large 
scale integration. 
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The CLIP Project 

Continuing the search for the optimum PE design, a series of array processors was 
constructed. These so-called Cellular Logic Image Processors (CLIPl to CLIP4) 
were gradually increased in complexity thus allowing each to be thoroughly 
understood before additional sophistication was permitted. CLIP 1 and CL1P2 will not 
be described here as all their important features were included in CLIPS. CLIPS will 
also not be discussed in detail since its main purpose was to provide a design study 
for a fully integrated version which could be manufactured and marketed. In fact, for 
reasons of cost, CLIP4 was slightly less complex than CLIPS. The higher level of 
integration in CLIP4 (8 PEs per integrated circuit) made it economic to build an array 
of 96 X 96 PEs whereas CLIPS had only 16 x 12 PEs and was not of practical value 
for applied image proeessing. 

The logic functions of CLIP4 are shown in outline in Figure 6. At the heart of the PE 
are two identical minterm generators. Each has two binary data inputs (A, the value 
of the local pixel, and a composite value derived from another pixel value stored in B 
and from data from neighbours), one binary output and four binary control inputs. By 
applying any of the sixteen possible 4-bit binary control words to a generator, any of 
the sixteen possible Boolean combinations of the two inputs can be produced at the 
output. The output from the lower generator is distributed to neighbouring PEs and 
the upper output is stored as a result. Each PE, on receiving inputs from neighbours, 
selects a subset by means of a programmable gate and ORs the subset with a single bit 
stored in local memory (B in the figure). Further gates allow the PE to act as a full 
adder. Additional connections are used to input and output data to and from the array. 
The detailed operation of the PE is too complex to describe in the space available for 
this review but a full description of the CLIP3/CLIP4 systems can be found in [3], [6]. 




c 



Fig. 6 •Schematic logic diagram of the CLIP4 processing element 
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Three classes of operation can be performed by these CLIP processors. They are 
those in which: 

• Each output pixel is a function only of the corresponding input pixel; 

• Each output pixel is a function of the corresponding input pixel and of the eight 
pixels surrounding it; 

• Each output pixel is a function of the corresponding input pixel and of any other 
connected to it by propagation through chains of neighbouring pixels. 

One further feature of the array is an OR-gate (not shown in the figure) with inputs 
from every PE, used to determine whether a binary image stored in the PEs has at 
least one pixel which is non-zero. In general, the PE processes single bit binary data 
in each operation; multiple bit data is processed one bit at a time, i.e., bit-serially. 
Although beyond the scope of this review, it can easily be shown that an array of PEs 
with the features listed here can be programmed to perform all image operations and, 
indeed, all mathematical calculations. In short, CLIP3 and CLIP4 are universal 
computing systems. 

The development of CLIP4 extended from 1974 to 1980. At that time, the CL1P4 
integrated circuit was the largest ever to be manufactured in the UK under contract to 
the universities and the technical difficulties experienced were immense. After this 
worrying development period, CL1P4 was applied to many image processing projects 
and was in constant use for the next 10 years. It was certainly, at the start, the largest 
working parallel processor array in the world and achieved the fastest real-time image 
processing at that time. 



7. Limitations of Image Processors 

Every dedicated image processing system has its limitations. Most embody as much 
parallel structure as is practicable but every design falls short in some way or another. 
Special purpose circuits providing a very restricted range of functions can only be of 
similarly restricted applicability, although some attempts have been made to build 
computers combining several special purpose circuits into one composite system. 
Their performance is not impressive since most of the units are idle for most of the 
time and the effective parallelism is low. 

The latency effect in pipeline processors together with the difficulty experienced in 
programming them in many applications has resulted in such systems falling into 
disuse. Processor arrays are also not easy to program although this is a skill which 
can be learned; there are no insurmountable difficulties in writing parallel forms of 
most image processing operations. 

A more serious problem is that processor arrays suffer from two related inefficiencies. 
The first is that, in general, moving images in and out of the array is a serial process 
and therefore slow. Secondly, moving data between extremes of the array (as, for 
example, is necessary when performing Fourier Transforms) involves stepping 
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through chains of neighbouring PEs and is also very slow. Both these inefficiencies 
can be lessened by adding more connection paths and this has been done in later 
systems, such as the Connection Machine [13]. A further problem is cost. Processor 
arrays are much less efficient when the size of the image array is larger than that of 
the processor array. Unless the level of integration can be made very high, the cost of 
constructing and assembling enough PEs to match images of television quality is too 
great for the majority of potential users. 

Research into parallel processing architectures for image processing has slowed down 
noticeably in recent years. On the one hand, the high cost of building these machines 
has made it difficult to obtain funding from the organisations which used to support 
this research. Equally, the long lead time for the production of new systems, taken 
together with the limited and uncertain market for the systems once they are 
produced, has discouraged manufacturing industry from continuing to be involved. 

However, possibly the major factor which slowed the pace of this field of research 
was the lessening of demand from the image processing community. The wide- 
spread availability of high-powered workstations and the ever increasing 
performance/cost ratio of PCs have meant that the priority for development of 
systems with higher speed has been displaced by a need for more effective algorithms 
in the majority of active areas in applied image processing. It is also the experience 
of many in the field that the image processing software packages which can be 
purchased tend to be disappointingly inflexible, especially when there is a need to 
incorporate new functions not contained in the original package. Consequently, 
development effort has been switched from hardware to software. 

A cynical comment on the state-of-the-art in image processing (or, perhaps more 
accurately, image analysis) would be that the computers now commercially available 
enable us to run bad programs adequately quickly and the use of even the best parallel 
processing methods would do nothing more than allow us to get poor results even 
faster. The same cannot be said about image generation, a wide-ranging subject 
embracing important and socially useful applications in the medical field as well as 
commercially profitable activities in computer games. In this area, there is always a 
demand for higher performance. 



8. Predictions for the Future 

Although the study of computer vision seems to be very unstructured and not 
progressing as well as had been optimistically expected three decades ago, there is 
still enough optimism amongst researchers to merit laying plans for the future when, 
it is believed, successful algorithms will have been developed and, once again, the 
need will be for faster processors. Enough is now understood about computer 
architecture to make it certain that adequately fast processing will only be achieved by 
the use of parallelism. At the same time, every attempt will have to be made to 
employ the fastest possible components. 
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There are physical limitations to the extent to which integrated circuit devices can be 
made faster. Current research is exploring these limitations by investigating 
nanotechnology where circuit components are defined with a precision approaching 
one nanometre (10’’ metre). If devices can be made to work with such dimensions, 
then it would be conceivable that a CLIP4 array of size 512 x 512, together with 
adequate amounts of memory local to each PE, could be formed on one integrated 
circuit slice. Furthermore, images could be input to the slice by projecting them onto 
photosensitive components located with each PE. The power of such a system would 
far exceed anything now in use and the cost, assuming the technology had been given 
time to 'mature', would be a mere fraction of that of today's supercomputers. 

Undoubtedly, there will be many major technical problems to solve. At this scale, 
long connections between parts of the array are difficult to fabricate. In particular, the 
distribution of control instructions synchronously across the array will be hard to 
achieve. Potential failure of devices embedded in the array will have to be combated 
by the liberal use of redundancy. 

There are some indications that it maybe hard to define and control the characteristics 
of the millions of devices in these giant arrays. If this is true, then a new style of 
programming might be necessary in which variability is not only accepted but also 
exploited. A Monte Carlo program running on a conventional computer gains its 
power to solve problems by introducing random numbers into what would otherwise 
be a completely predictable performance; could it be that a similar broadening of 
capability might be obtained by randomising the values of some of the device 
parameters in the processor arrays? 

There is a philosophical point to be made here. We have always looked to human 
vision as a sort of role model for computer vision system designers but this may have 
been unwise. Human vision is there to enable humans to survive in their 
environment, not to equip humans with a precise optical measuring system. In 
everyday life, a broad, comprehensive view of the world is all that is needed and the 
speed at which this must be obtained is only of the order of human reaction time, i.e., 
an analysis in a few tens of milliseconds. 

On the other hand, computer vision has generally been used to make fast and accurate 
measurements in a very constrained environment. This may imply that at least two 
very different types of image processing computer will be need: one in which speed 
and/or accuracy are the dominating goals and the other in which speed need not be of 
the highest but robustness in an unconstrained environment will be of fimdamental 
importance. 

Nevertheless, the ideas behind parallel processing computing are justified by the 
physiological example from which they sprang and that they were found to be 
effective when applied to computer architecture. It is difficult to conclude that 
tomorrow's computers will revert to a serial architecture. Parallelism is definitely 
here to stay. 
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Abstract. The most demanding image processing applications require 
real time processing, often using special purpose hardware. The work 
herein presented refers to the application of cluster computing for off 
line image processing, where the end user benefits from the operation of 
otherwise idle processors in the local LAN. The virtual parallel computer 
is composed by off-the-shelf personal computers connected by a low cost 
network, such as a 10 Mbits /s Ethernet. The aim is to minimise the 
processing time of a high level image processing package. The system 
developed to manage the parallel execution is described and some results 
obtained for the parallelisation of high level image processing algorithms 
are discussed, namely for active contour and modal analysis methods 
which require the computation of the eigenvectors of a symmetric matrix. 



1 Introduction 

Image processing applications can be very computationally demanding due to 
the large amount of data to process, to the response time required, or to the 
complexity of the image processing algorithms. A wide range of general purpose 
or custom hardware has been used for image processing. SIMD computers, using 
data parallelism, are suitable for low level image analysis, where each processor 
performs a uniform set of operations based on the image data matrix in a fixed 
amount of time; in [28] a special purpose SIMD computer with 1024 processors 
was presented. Systolic Arrays [II] which can exploit the regular and constant- 
time operations of an algorithm are also an option. 

MIMD computers, commonly used in simulation, are suitable for high level 
image processing, such as pattern recognition, where each processor is assigned 
an independent operation [3]. For real time vision applications special MIMD 
computers were developed; e.g. ASSET-2 based on PowerPC processors for com- 
putation and on Transputers for communication [29]. MIMD computers were 
characterised by exploiting a variety of structures; however, technological fac- 
tors have been forcing a convergence towards systems formed by a collection of 
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essentially complete computers connected by a communications network [9] . The 
processors in these computers are the same ones used in current workstations. 
Therefore, the idea of forming a parallel computer from a collection of off-the- 
shelf computers comes naturally, and fast communication techniques were also 
developed for that purpose [25]. Several cluster computing systems have been 
developed, e.g. the NOW project [2]. 

Our aim is not to build a specific cluster of personal computers for parallel 
image processing, but rather to perform parallel processing on already existing 
group clusters, where each node is a desktop computer running the Windows 
operating system. These clusters are characterised by having a low cost inter- 
connection network, such as a 10/100 M6fts/s Ethernet, connecting different 
types of processors, of variable processing capacity and amount of memory, thus 
forming a heterogeneous parallel virtual computer. Due to network restrictions, 
which do not allow simultaneous communication among several nodes, the ap- 
plication domain is restricted to about one or two dozens of processors. 

The motivation for a parallel implementation of image algorithms comes from 
image and image sequence analysis needs posed by various application domains, 
which are becoming increasingly more demanding in terms of the detail and 
variety of the expected analytic results, requiring the use of more sophisticated 
image and object models (e.g., physically-based deformable models), and of more 
complex algorithms, while the timing constraints are kept very stringent. 

A promising approach to deal with the above requirements consists in devel- 
oping parallel software to be executed, in a distributed manner, by the machines 
available in an existing computer network, taking advantage of the well-known 
fact that many of the computers are often idle for long periods of time [20]. 
It is quite common in many organisations that a standard network connects 
several general purpose workstations and personal computers, accumulating a 
very substantial computing power that, through the use of appropriate manag- 
ing software, could be put at the service of the more computationally demanding 
applications. 

Existing software, such as the Windows Parallel Virtual Machine (WPVM) 
[1], allows building parallel virtual computers by integrating in a common pro- 
cessing environment a set of distinct machines (nodes) connected to the network. 
Although the parallel virtual computer nodes and the underlying communication 
network were not designed for optimised parallel operation, very significant per- 
formance gains can be attained if the parallel application software is conceived 
for that specific environment. 



2 Image Algorithms and Systems 

The image algorithms that have been parallelised consist of a set of low level 
image processing operations namely edge detection [27,6], distance transform, 
convolution mask, histogramming and thresholding, whose suitability to the 
cluster architecture was analysed in [4] . A set of linear algebra algorithms which 
are building blocks for many high level image processing methods was also im- 




Parallel Image Processing System on a Cluster of Personal Computers 441 



plemented. These algorithms are the matrix product [14], LU factorisation [7], 
tridiagonal reduction [8], symmetric QR iteration [15], matrix inversion [23] and 
matrix correlation. 

In this paper, the results presented focus on high level image processing 
algorithms, namely active contours [19] and modal analysis [26]. 

Some image processing systems have been proposed to run on a cluster of 
personal computers. In [17] two highly demanding vision algorithms were tested 
giving superlinear speedup, due to memory pagination on a single workstation. 
The machines formed an homogeneous computer. In [18] a high level interface 
parallel image processing library is presented and results are reported for low 
level image operations on an Ethernet network of HP9000/715 workstations 
and on an ATM network of SGI workstations. In [21] a machine independent 
methodology was proposed for homogeneous computers; results were presented 
separately for two SMP workstations with two and eight processors, not requiring 
communication between machines. 

Our implementation differs from the ones mentioned above as it considers 
a general bus type heterogeneous cluster where data is distributed in order to 
obtain a correct load balancing, and also because the number of processors that 
participate in a distributed algorithm vary dynamically in order to minimise the 
processing time of each operation [5] . 

3 System Architecture 

The computers that belong to the virtual machine run a process to monitor 
the percentage of processor time spent with the local user. Conceptually, local 
users have priority over the distributed application and the computer will not 
be available if the mean local user time is above a minimum threshold during a 
specified period of time, e.g. 5 seconds. 

Each algorithm or task is decomposed until indivisible operations are ob- 
tained to which parallel code exists. When a parallel algorithm is launched the 
master process schedules work to the processors of the virtual machine according 
to their availability and choosing a number of processors that minimise the pro- 
cessing time of individual operations, allowing data redistribution if the optimal 
grid [4] of processors changes from operation to operation. 

As an example, the algorithm to extract the contour of an object can be de- 
composed into edge enhancement, thresholding and contour tracking operations. 

3.1 Hardware Organisation and Computational Model 

The hardware organisation is shown in figure 1. Each node of the virtual ma- 
chine is a personal computer under the Windows NT operating system, running 
WPVM software to communicate. The interconnection network is an Ethernet 
at 10/100 M6its/s. 

Several computational models [9,30,16] were proposed in order to estimate 
the processing time of a parallel program in a distributed memory machine. 
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Fig. 1. Hardware organisation 



Although they could be adapted for the cluster of personal computers, a specific 
and simplified model is presented below. 

The total processing time is obtained by summing the time spent in commu- 
nications (Tc) and the parallel processing time (Tp). Each node of the machine is 
characterised by the processor capacity Si, measured in M flop/s. The network 
is characterised by allowing only one message to be broadcast at a given time, 
the latency time (Tp) and the bandwidth (LB). The time to send a message 
(Tc) composed by nb bytes is given by: 



Tc 



nb 




packetsize 



( 1 ) 



The value K multiplies Tp due to the partition of each message into packets of 
length 46 to 1500 bytes (packet. size), existing a latency time for each packet; 
1024 is a typical packet size. 

The parallel component Tp of the computational model represents the oper- 
ations that can be divided over a set of p processors obtaining a speedup of p, 
i.e. operations without any sequential part: 



THn.p) = ( 2 ) 

The numerator ip(n) is the cost function of the algorithm measured in floating 
point operations (flop) as a function of the problem size n. For example, to 
multiply square matrices of size n, the cost is fj/n) = 2n^ [10]. 

3.2 Software Organisation 

Each operation is represented by an object containing the parallel and serial 
implementation of the code, since the system can schedule a sequential execu- 
tion remotely if it is advantageous. The object associated to the operation also 
contains information on the computational complexity and the amount of data 
required to exchange in order to complete the operation. Based on these param- 
eters the system determines the number and the identities of the intervening 
processors, in order to minimise the operation processing time [4]. 

Each data instance to be processed, either an image or a matrix, is repre- 
sented by an object responsible for accessing data items correctly according to 
the data distribution information. 

Data distribution is represented by independent objects with functions to 
locate any item of data and to translate global to local indexes and vice-versa. 
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Each object can be shared by more than one data instance. Figure 2 shows the 
software organisation. 




Fig. 2. Software organisation 



The user describes a macro of sequential operations to be executed referring 
the data instances to be processed. The system executes each operation in par- 
allel determining for each one the number of processors to be used in order to 
minimise the processing time. The data distribution suitable for each operation 
is coded in the operation code. 



Input il imagel.bitip 
Shencastan il i2 i3 0 
Histogram i2 outfile.txt i4 
Output i2 
Output i4 

Fig. 3. Macro describing the operations to be executed 



Figure 3 shows an example of a macro. The input file il is subject to an edge 
detector [27]; the operator outputs, the gradient’s magnitude and direction, are 
stored in i2 and iS respectively. The histogram is then computed and displayed 
as an image, its data being also saved in a text file. 

3.3 Data Distribution and Load Balancing 

Different strategies are applied to images and matrices. Images are partitioned 
in blocks of contiguous rows or columns and the blocks are assigned to each 
process [4]. This distribution is suitable for data independent image operators. 
Matrices are organised in square blocks of data and a novel version [5] of the 
block cyclic domain distribution [13], adapted to an heterogeneous environment, 
is used for assigning them to the processor grid. 

A balanced distribution is achieved by a static load distribution made prior to 
the execution of the parallel operation. To achieve a balanced distribution in the 
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heterogeneous machine the relative amount of data assigned to each processor, 
li, is a function of its processing capacity compared to the entire machine: 

For matrices, due to block indivisibility, it is not always possible to ensure an 
optimal load balancing; however, the scheduler computes the optimal solution 
for a given network [5] . The processor placement on the virtual grid is also done 
in order to achieve a balanced distribution. 

4 Parallel Implementation of the Active Contour 
Algorithm 

An active contour is defined as an energy minimising curve subject to the action 
of internal forces and influenced by external image forces which move the contour 
to the relevant features in the image, such as lines and edges [19]. 

Active contours can be used for a variety of feature extraction operations in 
images, such as detection of lines and edges, detection of subjective contours, 
tracking analysis in a sequence of images or correspondence analysis in stereo 
images. 




Detected edges Distance Transform Contour detection 



Fig. 4. Application of the active contour algorithm in an angiocardiographic 
image 



Figure 4 (rightmost image) shows the contour detection over the original 
image of 64 KB. From an initial position (arbitrarily or interactively defined) 
and by using an iterative process, the contour moves in order to minimise its 
energy. The final position corresponds to a local minimum of the energy function 
defined. In this position, all the forces applied to the contour are mutually can- 
celled, so that the contour does not move. The energy function was computed 
based on the edge detection map (leftmost image) and the distance transform 
map (middle image). The quality of the detection depends on these two images. 
Different energy functions can be used [24]; however, not all are suitable for every 
application. 
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The active contour points which are distant from the edges are pushed in 
their direction by the distance transform. The points close to edges are mostly 
influenced by the edge map energy which locally refines the detection. 




Fig. 5. Active contour algorithm decomposed in indivisible tasks 



Figure 5 shows the tasks required to apply the active contour algorithm. 
The computation methodology is to sequentially execute each parallelised task, 
choosing the grid of processors that minimises the individual processing time 
and, consequently, the overall time. 

The image operators have been discussed in another paper [4]. Therefore, 
only the parallelisation of the LU factorisation routine is considered here. 



4.1 LU Factorisation Algorithm 

The LU factorisation algorithm is applied in order to solve directly the sys- 
tem of equations resulting from the active contour internal forces: elasticity and 
flexibility. The implementation follows the right-looking variant of the algorithm 
proposed in [12]. However, adaptations where made at the load distribution level 
in order to obtain a balanced load for heterogeneous machines. Figure 6 (left) 
shows the load distribution obtained for a heterogeneous virtual machine. 




Fig. 6. LU and QR load distribution for a matrix size of 1800 and 1600 respec- 
tively for the machine M={244, 244, 161, 161, 60, 50, 49} M flops/s processors 
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For processor grids (1,4) and (1,5) a very good load balancing is achieved. 
For the other grids the three slower processors took approximately 15% less 
time than the other ones, due to the block indivisibility. The algorithm requires 
a significant number of communication points which results in a not very scalable 
algorithm as shown in figure 7 (left). 




Processors 




Processors 



LU, TRD and QR 



Matrix Correlation 



Fig. 7. Isogranularity curves for a 6 processor homogeneous machine connected 
by a 10 Mbit/s Ethernet; 160AT elements for TRD, LU and QR and 250K for 
LU2 



The scalability analysis was made in a homogeneous machine in order to 
reduce the influence of load unbalance. 

5 Parallel Implementation of the Modal Matching 
Algorithm 

This high level image processing algorithm [26] is applied for the tracking of 
deformable objects along a sequence of images. Figure 8 shows the applica- 
tion of the algorithm. It is based on finite element analysis and it requires the 
computation of the eigenvectors of symmetric matrices. The aim is to obtain 
correspondences between object points of image i and i + n. The algorithm is 
divided into eigenvector computation and matrix correlation. The eigenvector 
computation is subdivided into three operations: tridiagonalisation, correspon- 
dent orthogonal matrix and QR iteration. The parallelisation is then realised 
by the individual parallelisation of each operation. Data is redistributed if the 
processor grid changes between operations. 

5.1 Tridiagonal Reduction and Orthogonal Matrix Computation 

Tridiagonal reduction is the first algorithm applied to the symmetric matrix in 
order to obtain the eigenvectors. The algorithm output is a tridiagonal matrix 
T so that: 



A = Q^TQ 



(4) 
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Instant i Instant i + 2 Matching 



Fig. 8. Application of the modal analysis algorithm to a sequence of heart con- 
tours 



The matrix T replaces A in memory. As shown in figure 9 the best grid is a 
row of processors. Details of the algorithm can be found in [8]. 

The matrix elements of T, apart from the tridiagonal positions, store the data 
required for the second step of the eigenvector algorithm, i.e. the computation 
of Q. 

If the order of computation of the tridiagonal reduction was followed, an 
0{n^) algorithm would be obtained, corresponding to a matrix by matrix prod- 
uct in each step; n — 2 steps for a matrix of size (n, n). However, the computation 
can be efficiently organised as described in [22] for a sequential algorithm, ob- 
taining a scalable operation for the virtual machine. Figure 9 shows that the 
best grid is a row of processors. 



5.2 Symmetric QR Iteration 

The QR iteration is the last operation for the eigenvector computation. The aim 
is to obtain from the tridiagonal matrix T one diagonal matrix A where the 
elements are the eigenvalues of A: 



T = G^AG (5) 

The matrix G is then used to compute the eigenvectors Q' of A: 

Q' = QG^ (6) 

Matrix is obtained by iterating and updating it with the Givens rotations 
[15]. To obtain Q' a matrix by matrix product would be required. However, the 
operations can be organised in order to update Q' in each iteration, avoiding the 
last matrix product. In the update only two columns of Q' are updated. Based 
on this fact a scalable operation was implemented by allowing the redistribution 
of data. The optimal data distribution is by blocks of rows so that any given 
row is completely allocated to a given processor, avoiding communications be- 
tween processors for the update of Q'. The parallelisation implemented keeps 
the O(n^) chase operation in one processor which computes all rotations for an 
iteration, and distributes them over a column of processors. Then all processors 
update their rows, the 0{rt') part, in parallel and without communications. This 
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strategy has a huge impact in the scalability of the QR iteration as shown by 
the isogranularity curve in figure 7. A good load balancing is also achieved for a 
heterogeneous machine as shown in figure 6. 

The ideal grid for the QR iteration is the opposite (column vs. row) of the 
ones for tridiagonal and orthogonal matrix computation. This is the reason for 
considering indivisible operations and allowing redistribution of data between 
them to adapt the parallel machine to each operation. 

5.3 Matrix Correlation 

After the QR iteration has been computed for the objects in both images the 
eigenvectors are ordered in decreasing order of magnitude of the correspondent 
eigenvalue. The correlation operation measures the similarity between the eigen- 
vectors of both objects. The behaviour of the processing time function shown in 
Figure 9 is different from the other operations. The best grid is either a row or 
a column of processors. The parallel algorithm is scalable as shown in figure 7. 




Tridiagonal reduction Orthogonal matrix Matrix correlation 



Fig. 9. Estimated processing time for a 6 processor homogeneous machine con- 
nected by a 10 Mbit/s Ethernet 



6 Results 

Results are presented for machine Ml composed by 6 homogeneous processors of 
UlM flop/s each, M2={244, 244, 161, 161, 60, 50, 49} M flop/s and M3={161, 
161, 112, 80} M flop/s processors. Ml is connected by a lOMbit/s Ethernet, and 
M2 and M3 by a 100 Mbit/s one. The performance metrics used to evaluate the 
parallel application are, first, the runtime, and second the speedup achieved. To 
have a fair comparison in terms of speedup, one defines the Equivalent Machine 
Number {EMN{p)) which considers the power available instead of the number of 
machines which, for a heterogeneous environment, is an ambiguous information. 
Equation 7 defines EMN{p) and heterogeneous efficiency Eh, for p processors 
used, where Si is the computational capacity of the processor that executed the 
serial code, also called the master processor. 
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EMN{p) 






Speedup 

EMN{p) 



( 7 ) 



For the machine M3 EMN{A) = 3.19, i.e. using 4 processors of the heteroge- 
neous machine is equivalent to 3.19 processors identical to the master processor 
if this is the 161 M flop/s one. 




Skin tumor detection 
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Active contour results 



Fig. 10. Application results of the active contour algorithm 



The right table of figure 10 presents results for the parallel active contour 
algorithm executed in the M3 machine for an image oi 64 KB (figure 4) and for a 
256 KB one (the left picture in figure 10). The time Ti represents the processing 
time of the serial code in the master processor and Tvm the parallel processing 
time in the virtual machine. The number of processors selected in each step of 
the algorithm changes in order to minimise the processing time. 
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Fig. 11. Eigenvector computation in the M2 machine 



Results for the eigenvector computation are presented in figure 11 for machine 
M2 due to the wide application of the algorithm. As shown, the heterogeneous 
efficiency is near 80% for matrices with more than 1400^ elements. However, the 
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first metric is processing time which is reduced for matrices larger than 400^ 
elements. 
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Fig. 12. Modal analysis in the homogeneous machine Ml 



To show the improvement due to the dynamic management of the parallel 
processing system, results for the modal analysis algorithm are presented for the 
homogeneous machine Ml, figure 12. The left chart compares the computation 
time of the virtual machine VM when the optimal number of processors is se- 
lected, as indicated in the processing results table, against the processing time 
when the same number of processors are used for all stages of the algorithm. In 
the latter case the minimum time is obtained with 4 processors; however, the 
total time is higher than the time obtained with VM. 



7 Conclusions 

A operation based parallel image processing system for a cluster of personal com- 
puters was presented. The main objective is that the user of a computationally 
demanding application may benefit from the computational power distributed 
over the network, while keeping other active users undisturbed. 

This goal can be achieved in a transparent manner for the user, once the 
modules of his/her application are correctly parallelised for the target network 
and the performance of the machines in the network is known. The applica- 
tion, before initiating a parallel module, determines the best available computer 
composition for a parallel virtual computer to execute it, and then launches the 
module, achieving the best response time possible in the actual network condi- 
tions. 

Practical tests were conducted both on homogeneous and heterogeneous net- 
works. In both cases the theoretically optimal compnter grid was confirmed by 
the measured performance. A balanced load was achieved in both machines. The 
machine scalability depends essentially on the communication requirements of 
the operations. For QR iteration and matrix correlation the system is scalable; 
however, it is not so for the tridiagonal reduction. 
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Other generic modules will be parallelised and tested, so that an ever increas- 
ing number of image analysis methods may be assembled from them. Application 

domains other than image analysis may also benefit from the proposed method- 
ology. 
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Abstract. A method for mapping nonlocal image processing algorithms 
onto Orthogonal Multiprocessor Systems (OMP) is presented in this pa- 
per. Information is moved between memory modules by alternating the 
processors mode of accessing the memory array. The introduction of syn- 
chronisation barriers when the processors attempt to change the mem- 
ory access mode of the OMP architecture synchronises the processing. 
Two parallel and nonlocal algorithms of the low and intermediate image 
processing levels are proposed and their efficiency is analysed. Tfie per- 
formance of the OMP system for these type of algorithms was evaluated 
by simulation. 



1 Introduction 

The OMP architecture can be classified as a parallel shared memory architec- 
ture [1, 2]. The most important characteristics of a n-processor OMP architecture 
are (see Fig. la): i) the memory is divided in n? memory modules organised as a 
two-dimensional array; ii) the processors and the memory modules are intercon- 
nected by non-shared buses; Hi) processors are allowed to concurrently access 
distinct rows or columns of the memory array. 

The OMP architecture has two different mutually exclusive modes of oper- 
ation, which correspond to different ways of accessing the memory: row access 
mode and column access mode. With the architecture in row access mode, any 
processor Pi has direct access to the row i of the array of memory modules 
(Mjj 0 < j < n); with the architecture in column access mode, any pro- 
cessor Pi has direct access to the column i of the array of memory modules 
{Mj^i 0 < j < n). Therefore, the system buses are not shared and the memory 
access is free of conflicts in each one of the access modes. 

The communication between any pair of processors of an OMP architecture 
can take place on two different memory modules: e.g. the pair of processors 
(Pi,Pj) can communicate through the memory modules {Mij, Mj^i). On the 
other hand, modules located on the main diagonal of the memory matrix are 
only accessed by a single processor: e.g. Mi^i is only accessed by Pi. 

Spatial (image) parallelism is often found in low and intermediate level image 
processing operators [3]. Unlike task parallelism, in image parallelism the algo- 
rithm is applied in parallel to separated parts of the original image. Generally, 
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(a) (b) 

Fig. 1. OMP architecture with 4 processors: a) logical diagram; b) distribution 
of the image by the memory modules. 

OMP systems have a rather low number of processors relatively to the number 
of pixels of an image. The image is distributed by the processors of an OMP 
architecture by mapping the array of pixels into the array of memory modules 
(see Fig. lb): a TV x TV image is divided into n? sub-images, with = (N/n)'^ 
neighbour pixels, and sub-images with neighbour pixels are placed in adjacent 
memory modules. 

This paper presents the design and analysis of nonlocal image processing par- 
allel algorithms for OMP systems. The parallel algorithms considered, presup- 
pose the above mentioned distribution of the image among the memory modules. 
This paper begins by presenting a general method for mapping nonlocal image 
processing on OMP systems (Section 2). Afterwards some parallel algorithms for 
typical image processing tasks are proposed (Section 3). In Section 4 the per- 
formance of an OMP image processing system is evaluated using the simulation 
results. Finally Section 5 is devoted to the conclusions. 

2 Nonlocal Image Processing on OMP Systems 

There is a much more demanding interaction between processors in nonlocal than 
in logical image processing tasks. The rules that govern the information transfers 
can be unknown a priori, because they depend on the processing parameters 
and/or on the image content. 

Communication oriented models of parallel processing explicitly include com- 
munication aspects of the processing. The parallel processing results from an 
alternate sequence of stages: a computation stage followed by a communication 
stage [4]. In a computation stage, the processing is carried out independently by 
the multiple processors. In the communication stages, instead, nonlocal informa- 
tion, required for the next processing stage, is exchanged among the processors. 

With the image parallelism approach, parallel processing stages consist on 
the application of similar image processing operators to different parts of the 
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image. The processors of an OMP system have direct access to the image pixels 
stored on row and column modules of a two-dimensional memory array. 




©ilEliSDiQi 








(a) (b) (c) 

Fig. 2. Data transfer procedure for nonlocal image processing on OMP systems. 



Nonlocal image processing often requires that random transfers of data be 
carried out at the communication stages. As referred in Section 1, processors of 
an OMP system may exchange data through pairs of exclusive memory modules. 
Therefore, if the processor Pi needs to access data stored on the memory module 
Mj^k. it has to establish communication with processor Pj or Pk- With the OMP 
architecture in the column access mode, data can be moved using the following 
procedure (the procedure is illustrated in Fig 2 for Pq and M^^ 2 )' step a) Pi 
places the address of the data, namely the index of array memory row and the 
relative position inside the module, on the memory Mj^i] step b) memory access 
mode is changed and Pj accesses the data address on Mjj and moves the data 
from Mj^k to Mj^; step c) the architecture is put back in the initial memory 
access mode and processor Pi directly accesses the claimed data on the Mjj 
memory. 

This procedure can be extended in order to transfer data and partial results 
concurrently among any processors of an OMP system. At the end of a processing 
stage, the addresses of the data to be transferred are placed in the memory 
modules according to the rules presented above. Processors request a change of 
the memory access mode to signal the transition from a processing stage to a 
communication stage. These requests are used to implement a synchronisation 
barrier: each processor reaching the barrier waits until all other processors have 
also reached the barrier. A communication stage begins by changing the memory 
access mode of the architecture. Then, each processor looks for the addresses of 
the data that have to be moved on all memory modules of a row (see Fig. 2b) 
and undertakes the task. In spite of the data being concurrently moved by the 
processors, there is no guarantee about a fair work distribution between the 
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processors, since the source and destination addresses are unknown a priori. The 
exit of a communication stage is made by implementing another synchronisation 
barrier, based on the requests for changing the memory access mode. 

The time spent in the communication stages can lead to a considerable de- 
crease of the system efficiency. The procedure proposed for global transfer of 
data is simple and general but can, itself, present the following drawbacks: a) 
the time spent for the data transfer can be relevant relatively to the processing 
time; b) the task of data transfer can be unbalanced among the multiple pro- 
cessors. Statement a) is connected with the granularity of the parallelism and 
with the time spent in the control of the architecture. The other statement also 
influences the time spent in the communication stages: an unbalanced transfer 
load can also contribute to the lowering of the processing efficiency. 

These problems are referred in the following sections with the analysis of typ- 
ical nonlocal image processing tasks. Efficient parallel algorithms for geometric 
image transformations and for the Hough transform are subsequently proposed. 

3 Nonlocal Image Processing Parallel Algorithms 

Two parallel algorithms used in nonlocal image processing tasks are proposed: an 
algorithm for image rotation, that requires a nonlocal exchange of information 
as a function of the rotation parameters; and an algorithm for computing the 
Hough transform of an image, which is a transformation commonly applied in 
intermediate level image processing. 

The Image Rotation 

Geometric transformations, namely rotation, are often used in image process- 
ing [5]. For rotating an image by 6 around a centre point (io,jo), pixel values 
have to be moved to new locations, which are calculated through the parametric 
equations (1). Due to the discrete nature of the image representation, a well 
defined one-to-one transformation is applied to each pixel of the 

rotated image {i',j') to calculate its location on the original image (i,j). 

i = R~'^{i',j',e) = l{i' -io) cos 9 + {f -jo) sine + io\ 
j = R~^{i',j', 9 ) = [{j' - jo) cos 9 - {i' -io) sine + jo\ ■ (1) 

Pixel (i',j') of the processed image receives the value of pixel (i,j) of the 
original image, or a new value resulting from the interpolation of the values of 
the pixels that surround (i,j) [5]. 

The method for exchanging information on the OMP architecture, presented 
in the previous section, can be applied to compute in parallel (1). The coordinates 
of the pixels of different sub-images are computed in parallel and their values are 
placed in the proper memory modules. Two main parameters have to be taken in 
order to have a full specification of the parallel algorithm: the number of pixels 
that should be processed in parallel and their locations in the sub-images. 
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To define the number of pixels, two opposite conditions have to be con- 
sidered: a) for each communication stage the memory access mode has to be 
changed twice and all memory modules have to be checked — more pixels per 
processing stage allows less relative time spent in communication; b) the ex- 
tra memory required grows with the number of pixels processed per stage — the 
worse situation occurs when all coordinates calculated by a processor (e.g. n) 
stay in the same memory module extra memory). The premise assumed 

in this paper points to the processing of n pixels by each processor in a certain 
processing stage. 

The location of the pixels influences the balancing of the work load of the pro- 
cessors during the communication stages. In order to analyse the requirements 
of data movement, in the perspective of an orthogonal transfer, lets suppose that 
the transfer take place through the rows of the memory modules and consider 
the two simplified situations: a) the rotation of an image around its central point, 
using an algorithm that processes in parallel n pixels (1 pixel per processor) of 
an image row; b) a situation similar to the described above, except that the 
pixels to be processed belong to a single column of the image. This last situation 
is only used for analysing data transfer, because, in column access mode, the 
pixels can not be processed in parallel. 

For situation (a) , pixels processed in parallel have the same i' coordinate and 
a j' coordinates that differs by multiples of k (see Fig. lb), while for situation 
(b) an inverse relation between coordinates is observed. The pixels processed in 
situation (a) give rise to coordinates of the original image (1) in the following 
rows of the memory array: 




O'! 



0(rad) 



1/ |SIN(0)| 

1/ |cos(e)| 



Fig. 3. Number of pixels to be transferred per row of memory modules (1 pixel 
per processor) for image rotation. 
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+ io/K + ^ X sin 0J ; 0 < < n 



(2) 



In this case, the coordinates of two pixels go to the same row of memory modules 
for ^ X sin 6 < 1. So, the number of coordinates that go to a single row of memory 
modules (Aa) can be approximately expressed by the equation Aa = 1/| sin0|. 

In situation (b), pixels go to the memory modules of the following rows: 



[i/kJ 




zo) cos 6 



if 



jo) sin 61 

K 



+ + ^ X cos ; 0 < < n 



(3) 



The number of coordinates that go to a single row of memory modules (Ab) can 
be approximately expressed by the equation A& = 1/| cos6|. 

A diagram of both functions (Aa and Ab) is depicted in fig. 3. The maximum 
number of coordinates in fig. 3 should not be infinite, but limited by the number 
of pixels processed. However, the diagram of fig. 3 shows that the maximum 
values of the functions Aa and Ab are out of phase (7t/2). Moreover, the maximum 
value of one function corresponds to the minimum value of the other, thus, if 
the coordinates of n rows of pixels (6 situation) with n pixels each (a situation) 
are calculated in each processing stage then the maximum number of pixels that 
have to be transferred per memory row results from the combination of the 
numbers found for both situations. 




Fig. 4. Mean values of the maximum number of pixels that have to be transferred 
for a row of memory modules (n pixels per processor) for image rotation. 



The diagram of fig. 4 displays the mean values of the maximum number of 
pixels that have to be transferred in a row of the memory modules, for rotating an 
image around its central pixel. The diagram, obtained by computer simulation. 
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shows that no more than 2 x n pixels have to be transferred in any row of 
the memory modules. The maximum number of pixels occurs when the rotation 
angle is multiple of tt/ 4 — the intersection points of the functions A in the diagram 
of the fig. 3. 



Algorithm 1 Image rotation on the OMP architecture. 



do_all processors {<j> = 0, . . . ,n — 1) 

<± mode > 

for 1 =0 to — 1 do begin 
R()>,h := 0 { number of pixels to transfer } 
for m =0 to n — 1 do begin 

1 . { computes-(i, j) Eq. 1 for i' = m x k + \1/k\ ■, j' = <f> x k + I mod k } 
if (i and j inside the image limits) then begin 

2. Pm.cj} • b Q . Rli/K},4>~^ ^ [i/ nj • b ^ [i/ k] .<p[Q]-y • j 

end 

else := -1 

end 

<h mode > 

for h =0 to n — 1 do begin 
for m =0 to do begin 

3. i := S 4 ,^h[m].x; j := S^,h[m].y, T^,h[m] := [i mod K][j mod k] 

end 

Rij>,h ■— 0 

end 

<T mode > 

for m =0 to n — 1 do begin 
if (Pm, 4 > > 0) then begin 

4 . W := Pm,p, Q ■■= Rm,4>++; I -Pm, 4,[[1 /k\][ 1 mod k] ■= Tyw/ k],<I,\Q] 

end 

end 

end 

end_all 



The parallel algorithm for the image rotation on an OMP architecture is for- 
mally specified in Algorithm 1. The construct do_all processors ((/) = 0, . . . , n — 
1) . . . end_all means that the processing inside the block is done in parallel by 
the n processors. The construct < . . . > indicates a synchronisation point for 
changing the memory access mode of the architecture (h for row access and 
T for column access). Data structures are named using capital letters. Indexes 
are associated to them, to indicate the modules used, whenever the variables 
are stored in the shared memory. The original and processed image arrays are 
identihed by the symbols I-O and I-P, respectively. 

Lines 1 and 2 of the Algorithm 1 are devoted to the parallel computation 
of the coordinates of n pixels and to its placement on the adequate modules 
for orthogonal data transfer. Pixels of the original image are moved across the 
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modules of each row of memory modules in line 3, with the architecture in 
the row access mode. In line 4, the values of the pixels are placed on the new 
coordinates, with the architecture in the column access mode. The procedure is 
repeated times for processing all the N x N pixels of an image. 

The complexity of the algorithm is 0{^) for lines 1, 2 and 4. The maximum 
number of pixels which have to be transferred in a row of memory modules (line 
3 of the algorithm) is proportional to (see fig. 4). Therefore, the Algorithm 1 
is and has an asymptotic efficiency of 100%. 



The Hough Transform 

The Hough transform, which is often applied for shape analysis due to its ro- 
bustness to noise [6], is calculated using information about the binary edges of 
the images [7]. For detecting collinear points, the edge point (i,j) is transformed 
in a set of {pi, 6i) ordered pairs — p is the normal distance from the origin of the 
image to the line and 9 is the angle of this normal with the horizontal axis. The 
following parametric equation is applied for any {i,j) using a stepwise increment 
for 9i {59)-. 



Pi = j X cos 9i+ix sin 9i , (4) 

with collinear points voting to a common value {px, 9x). 

The processing load involved in the calculation of the Hough transform de- 
pends on the image features. Therefore, the distribution of the sub-images of 
equal size by the processors is not enough to secure the balancing of the pro- 
cessing load. One way of balancing the work of the processors is by segmenting 
the Hough space and by forcing each processor to compute (4) for all edge pixels 
of the image, but only for a range of 9i [8]. With the image distributed by the 
memory modules, an algorithm of this type demands the transmission of the 
coordinates of the edge points found by each processor to all other processors. 




M 
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n-1,1 
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0,n-l 
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Fig. 5. Distribution of the image and the transformed space by the shared mem- 
ory. 
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Let’s distribute the image and the Hough space by the memory modules 
as depicted in Fig. 5. In a first approach, aach processor must transmit the 
coordinates of the edge points found in a set of pixels of its sub-image to all 
other processors, during a communication stage. However, if the pixels of the 
different sub-images to be processed in parallel have the same relative positions, 
the quantity of information which has to be transmitted can be reduced: only 
binary data about the presence of edges among the pixels is required. 



Algorithm 2 Hough Transform of images on the OMP architecture. 



do_all processors (^ = 0, . . . , n — 1) 
for 1 =0 to — 1 do begin 
<± mode > 

R :=0 

for m =0 to n 1 do begin 

1. if mod k] 7 ^ 0) then i? := i? -b 2™ 

end 

for m =0 to n — 1 do begin 

2. :=R 
end 

<h mode > 

for h =0 to n — 1 do begin 

R ■= T^,h 

for m=0 to n — 1 do begin 

if((i? A 01) 7 ^ 0) then begin 

for t = (px to (0 -f 1) X - 1 do begin [pmax » 1, 5 x TV} 

3. W := ((T mod K -b /i X k) X cos[t] -b (1 /k - b m X k) X sin[t])/1.5 

4- mod«;]-b-b 

end 

end 

R :=R/2 

end 

end 

end 

end_all 



A parallel algorithm for the Hough transform calculation on an OMP ar- 
chitecture is formally specified in Algorithm 2. This algorithm guarantees the 
balancing of the processing by the multiple processors, but requires an angular 
resolution (SO) such that n < n/SO. 

Processors start a communication stage checking for the presence of edges 
among n pixels of a sub-image, located on the n different modules of a column — 
information about the location of these pixels is not relevant, since it is common 
to all processors. Information about the edges is coded on a single memory word 
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and broadcasted to all memory modules of a column (line 2 of the Algorithm 2). 
The architecture operation mode is changed for the row memory access and 
each processor collects and decodes the information transmitted by all other 
processors, consulting all the memory modules in a row. Every time an edge is 
found each processor computes (4) for a range of values of 6i — different 
values — (line 3 of the Algorithm 2). The Hough space is distributed in such a 
way that the resulting p values calculated by a processor are locally stored in 
the row of memory modules accessible to it (see Fig. 5). The accumulators in 
the Hough space are updated with no further need of data moving (line 4 of the 
Algorithm 2). 

Supposing an image with C edge points, the complexity of line 3 of the 
Algorithm 2 is 0(C x (4). The coding and the transfer of information 

about the edges (line 1 of the Algorithm 2) takes O(^) more steps. The overall 
asymptotic efficiency for Algorithm 2 is 0{N^ x since C < N“^. 

The parallel algorithms proposed for image rotation and Hough transform 
on an OMP architecture have a processing efficiency of 100%. However, in prac- 
tice, the time spent in communication limits the performance of the processing 
systems. The operation of an OMP system was simulated by computer, in order 
to predict the performance of a real image processing system. 

4 Performance Analysis 

In this section, the performance of an image processing OMP system is evaluated 
based on results obtained by computer simulation. The simulations have been 
done by adopting an algorithm driven approach and by using programming 
tools developed by the authors [9]. These tools allow the inclusion of detailed 
information about the operation and characteristics of the target OMP systems 
in the simulation process. It is therefore possible to achieve simulation results 
with a great degree of confidence. 

Several parallel image processing systems with different characteristics can 
be designed around an OMP architecture [10, 11]. The systems can use different 
types of components and can adopt specific techniques for accelerating the access 
to the shared memory {e.g. interleaved memory access mechanisms). The charac- 
teristics of the processing discussed in this paper point to the use of processors 
able to perform fast floating point operations and with a very short memory 
access cycle. The control of the image systems should be simple, allowing the 
implementation of synchronisation barriers and the change of the memory access 
mode with small waste of time. 

An image processing system with the following main characteristics was sim- 
ulated: a) the specifications of the individual processors are related to the ones 
of a signal processor [12] — 16 MIPS and 33 MFLOPS approximately, and a cycle 
time of about 60 ns; b) the number of instruction cycles needed for changing the 
system memory access mode is considered to be 10; c) read and write memory 
operations take only I instruction cycle; d) simple memory access schemes are 
considered, without memory interleaving. 
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The simulations of the OMP system were carried out with the parallel algo- 
rithms proposed for the image rotation and the Hough transform. The expected 
processing times for the OMP system were collected from the simulations and 
are presented in Tables 1 and 2. These simulations were made with 512 x 512 
images. For the simulations with the Hough transform a synthetic image (see 
fig. 7a) and a real image were used (see fig. 7b), by applying the Marr algorithm 
for edge detection [13] — application of the Laplacian operator to the image previ- 
ously filtered with a Gaussian function (cr = 1.5). The edge detection algorithm 
uses local processing which is easily mapped onto an OMP architecture [14]: 
on the borders of the sub-images, pixel values required to apply the filter are 
moved by orthogonal transfer. Tables show the processing efficiency {SJ-) mean 
and standard deviation for different numbers of processors — is defined as 
the quotient between the time of the sequential processing and the time of the 
parallel processing multiplied by the number of processors. 



Table 1. Performance of the OMP system for image rotation. 
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Table 1 presents the times spent with Algorithm 1, for the image rotation 
according to various rotation angles (6). The efficiency is generally not high, with 
communication and processing stages taking comparable times. The processing 
times approach the video frame rate when the number of processors increases to 
32. The mean value of the efficiency varies with the rotation angles in agreement 
with the analysis presented in a previous section. Other simulations are made 
by duplicating the number of pixels considered in each processing step [14]. For 
the lowest mean value of the processing efficiency {9 = 45°) an improvement of 
approximately 5% is achieved, while for the highest mean value {9 = 180°) an 
improvement of approximately 7% is achieved. 

Rotation angles that initially lead to greater processing times give rise to 
images with regions of pixels that do not have a correspondence with those of 
the original image. This problem results in a non uniform distribution of the 
processing load and reinforces the weight of communication times in the total 
time spent. 

Table 2 presents the times spent in the edge detection and the Hough Trans- 
form tasks for two different types of images: a synthesised binary image (Fig. 7a) 
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Fig. 6. Processing efficiency for image rotation. 




Fig. 7. 512 X 512 binary images used for the Hough transform: a) a synthesised 
image (binary); b) the edge detected for a real image (Marr algorithm). 



Table 2. Performance of the OMP system for the image segmentation task using 
Hough transform. 
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(a) (b) 

Fig. 8. Processing efficiency for image segmentation (edge detection plus Hough 
transform: a) for the real image; b) for the synthetic image. 



and a real image with 256 grey levels (Fig. 7b shows the edges detected in the 
real image with the Marr Algorithm for a = 1.5). For each image, Algorithm 2 
was applied for two different values of 56: 56 — tt/ 512 and 56 = tt/64. 

For the real image, the mean value of the efficiency is relatively high for a 
low value of 56 — each processor has to calculate (4) a great number of times 
in each processing stage. Table 2 and Fig. 8 show the decline of the efficiency 
with the increase of the 56 value. The image of Fig. 7b was synthesised in order 
to allow the observation of processing efficiency in a very adverse situation. 
The number of supposed edges is quite small which implies a small processing 
load when compared to the communication requirements. The mean value of 
the processing efficiency decreases to approximately half the value obtained for 
the real image. The decrease of the processing efficiency with the number of 
processors is also greater than the observed for real images. 

5 Conclusions 

A method for mapping nonlocal image processing tasks onto OMP systems, us- 
ing image parallelism, is presented in this paper. No a priori knowledge about 
information exchange requirements is needed. The processing is modelled by a 
sequence of processing and communication stages. The information is orthogo- 
nally transferred or broadcasted during the communication stages in parallel by 
the multiple processors. The processing is synchronised at the begin and at the 
end of the stages by implementing synchronisation barriers. 

The effectiveness of the method is shown by the development and analysis of 
parallel algorithms for typical nonlocal image processing, namely for the image 
rotation and the Hough transform tasks. The analysis of the complexity of the 
algorithms demonstrates that the processing efficiency achieved is 100%. 
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The performance of an OMP image processing system is evaluated based on 
simulation results. The times for the proposed nonlocal image processing parallel 
algorithms, show that the system can have a good performance if the following 
aspects are considered: i) a fair division of the processing in a stage is almost 
achieved by distributing equally sized sub-images among the processors — which 
also means among the memory modules; ii) the number of pixels processed in 
each step must be big enough, in order to reduce the communication time weight 
in the total processing time; Hi) the responsibility of transferring the information 
in a step should be divided by the multiple processors uniformly. 
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Abstract. The reconstruction of tracks left by particles in a scintillating 
fiber detector from a high energy experiment is discussed. The track 
reconstruction algorithm is based on the Hough transform and achieves 
an efficiency above 86%. The algorithm is implemented in a 16-node 
parallel machine using two parallelism approaches in order to speed up 
the application of the Hough transform, which is known to have large 
computational cost. 



1 Introduction 

In modern high-energy particle collider experiments, the trackers play an impor- 
tant role. Their task is to reconstruct the tracks left in the detectors by reactions 
resulting from particle beam collisions. This is a very important task, as it allows 
the computation of the momentum of particles. 

At CERN ([!]), the European Laboratory for Particle Physics, LEP {Large 
Electron- Positron Collider) collides electrons and positrons at four detection 
points, and the resulting reactions are observed by a set of sub-detectors that 
are placed around such collision points. In our case, we focus on the SFT {Scin- 
tillating Fiber Tracker), a tracker that has been developed to operate at L3, one 
of the four detectors present at LEP. 

The structure of the SFT can be seen in Fig. 1. The SFT has two shells, placed 
0.187 m and 0.314 m away from the collision axis. These shells are composed of 
very thin scintillating fibers with 60 /rm of diameter, arranged in groups of 1000 
fibers. Each shell has four sublayers of 2 mm: two of them {4>i and 4>2) provide 
the coordinates in the r<p plane, and the other two {u and v) furnish, through 
stereo vision, the coordinates in the rz plane. One may note that the tracks are 
only observed in two such small regions, which have a large gap between them. 

As the resulting sub-particles reach the tracker shells, Hbers get excited, 
producing light and transmitting it towards the light detectors, which convert 
light into an electrical signal. The light detectors are image tubes made of silicon 
chips, which have to read out the position at which the tracker was reached by 
a particle. Fig. 2 illustrates this process. These pixel-chips function similarly to 
CCDs (Charge Capacitor Devices). 



J.M.L.M. Palma et al. (Eds.): VECPAR2000, LNCS 1981, pp. 467-480, 2001. 
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active surface=32mm 
16 X 64 = 1024 Pixel 



64 columns x 63|Tm = 4 mm 



Fig. 2. Readout of interaction positions by pixel chips. 



Data used in this work were generated by Monte Carlo simulation ([2]). 
The software package PYTHIA 5.6/JETSET 7.3 ([3]) was used for generating 
electron-positron collisions at LEP conditions, and the simulation of the SET 
structure was made by using the software GEANT 3.159 ([4]). The generated 
events correspond to 2D images^ formed by pixels that are produced on the 
tracker due to particle interactions with the detector. A typical event to be re- 
constructed is shown in Eig. 3. Tracks consist of helices, totally described by their 
two parameters: curvature k and angle 9. Low energy tracks are removed from 
the original image^, as they do not represent tracks of interest of the envisaged 
physics. 

In this paper we propose to use the Hough transform ([5]) to reconstruct 
the tracks of the SET. The Hough transform is known to be a powerful tech- 
nique for track reconstruction, although its application is limited by its highly 
intensive computational requirements. Therefore, envisaging speed up of the re- 
construction procedure, we exploit the parallelism in the Hough transform and 
implement the track reconstruction algorithm on a 16-node parallel machine. 

In the following two sections, the fundamentals of the Hough Transform and 
the main features of the parallel machine (TN-310 system) are presented. Section 
4 details the parallelization techniques developed and the achieved results in 
terms of speed-up and track reconstruction efficiency. Finally, some conclusions 
are derived in Section 5. 



^ The third coordinate (z) is considered to be zero for the tracks we have. 

^ This filtering process is a kind of preprocessing and is not a task of the reconstruction 
algorithm. 
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Fig. 3. A typical event: pixels (top) and the corresponding target tracks (bot- 
tom). 



2 The Hough Transform (HT) 

The Hough Transform was introduced by Paul V. C. Hough in 1962 ([5]). The 
initial idea was to use it in the detection of complex patterns in binary images. 
The method has encountered significant resistance to its adoption due to its 
enormous computational cost. As substantial improvements with respect to its 
implementation have been made in the last years, the Hough transform finds 
applications nowadays in many image reconstruction problems. 

2.1 The Standard Hough Transform (SHT) 

The HT’s main goal is to determine the values of the parameters that completely 
define the original image shape. In order to achieve this goal, the input space is 
mapped onto the parameter space and by histogramming the resulting mapped 
points the parameter values are determined. In other words, a global detection 
problem (in the input space) is converted into a local one (in the parameter 
space) . 

As an example, let’s consider the problem of detecting straight lines in a 
noisy environment. Using the slope m and the offset c as the parameters for the 
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straight lines, the model is obtained from these two parameters (where the hat 
indicates the estimate of a parameter): 



y = mx + c (1) 

From this equation, a relation / can be derived as: 

f{{m,c),{x,y)) =y -mx ~c = 0 (2) 

This relation maps each possible combination of parameter values (m, c) onto 
a set (x,y) of the input space, that is, the parameter space is mapped onto 
original input data space. From this, an inverse relation, say g, can be defined, 
so that it maps the input space onto the parameter space. This the so called 
hack-projection: 



g{{x,y),{m,c)) = y - xm - c = 0 (3) 

For the straight line problem, relation g results in: 

c = —xm + y (4) 

After having back-projected the input space onto the parameter space, we 
search for the regions in the parameter space for which the “density of lines” is 
high, that is, we search for values of (to, c) that have large probability in rep- 
resenting an actual straight line from the input space. This search is performed 
by building an accumulator over the parameter space, which performs data his- 
togramming. As an example, consider an input space as shown in Fig. 4, where 
four straight lines have to be detected from their pixels, in spite of the noisy 
pixels. Performing the corresponding back-projection, the histogram in the pa- 
rameter space is obtained as shown, where the values for to and c corresponding 
to each straight line can be estimated from the four clear peaks observed in the 
parameter space. 

The accuracy in the estimation of the curve parameters depends on the gran- 
ularity of the accumulator. As higher granularity (higher number of channels in 
the histogram) in the accumulator produces better accuracy, the estimation of 
the actual parameters improves with finer granularity. Thus, the computational 
cost for the SHT is usually very high. 

The SHT can be easily extended for arbitrary shapes. Relations / and g are 
simply generalized to: 



/((ai,a2,...,d„),(x,2/)) =0 (5) 

g{{x,y),{ai,a2,-,an)) = 0 (6) 

For the track reconstruction problem, we search for peaks in the k6 parameter 
space in order to detect helices in the input space. 
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Fig. 4. Four straight lines to be detected in a noisy image (top) and the corre- 
sponding histogram in the parameter space (bottom). 



2.2 The Local Hough Transform (LHT) 

The Local Hough TVansform [6] aims at reducing the number of operations re- 
quired to accumulate the HT. Instead of back-projecting each point from the 
input space onto an entire curve in the parameter space, the LHT maps each 
pair {{xi,y\), {x 2 ,yo)} of pixels from the input space onto its unique correspon- 
dent point («!, 02 ) in the parameter space. For the straight line case, this requires 
the solution of the pair of equations below; 



yi = mxi -1- c 
«/2 = mx 2 + c 



(7) 

( 8 ) 
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Of course, if there are n parameters to be determined, a set of n equations 
must be solved. The LHT reduces drastically the number of calculations neces- 
sary for histogramming the parameter space, as it does not require a scanning 
process. Although it has restrictions for some problems, the LHT is well suited 
for the track reconstruction problem because data are concentrated in two small 
(local) regions of the input space. Indeed, we do not perform every combination 
of pixels, but pairs are only formed from pixels belonging to different shells and 
whose cf> (polar) coordinates do not differ too much^, as very curly tracks are 
not expected from the physics of interest in the experiment. 



3 The TN-310 System 

The TN-310 system ([7]) is a MIMD parallel computer with distributed memory. 
It houses 16 nodes that comply with the HTRAM {High performance TRAns- 
puter Modules) standard, equally split into two cards. Each node can commu- 
nicate to any other node by sending messages through a fast interconnection 
network based on STC104 chips ([8]). 

Each HTRAM node contains a transputer (InMOS T9000), 8 MB of RAM 
memory, a DSP (ADSP-21020) and a buffer of 256 KB for communication be- 
tween the transputer and the DSP. The transputers are very good processors for 
communication tasks, due to their VCPs (Virtual Channel Processors), and the 
DSPs are optimized for signal processing applications. Pig. 5 shows the general 
architecture of the machine, including the interconnection network. 

Note that the numbers of switches involved in the communication path varies 
for different HTRAM nodes. Therefore, to achieve faster speed, an optimum 
placement of processes into nodes must be realized and this is explored in this 
work. A PC running Windows hosts the system. 

Programming for the TN-310 system involves both describing the system 
configuration and coding the tasks to be run on each processing node. In terms 
of system configuration, the number of processors to be used must be known to 
the system, and the way they will communicate must also be clearly established 
and configured. These features are described in an extra configuration file. 

The TN-310 system provides three layers of programming: PVM (Parallel 
Virtual Machine), RuBIS (micro- kernel) , and C-Toolset, which was chosen for 
this work due to its faster execution time. This environment allows processes to 
be coded in ANSI C language. Some libraries were developed to include commu- 
nication functions and procedures. 

Pig. 6 illustrates the general process of generating a single executable file 
within the environment C-Toolset from conhguration and process codings. A 
makefile has to be built in order to hold compilation and linking correctly. The 
internal structure of the machine is kept in a Hie written in a specific language 
(NDL-Network Description Language), and can not be changed by users. 

If the difference between the polar coordinates of two points is greater than 20 
degrees, we do not form a pair. 



3 
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4 Track Reconstruction 

The tracking reconstruction algorithm based on the Hough transform was imple- 
mented on the TN-310 system. Two different parallel approaches were developed 
and compared with a sequential implementation in the same environment. 

4.1 The Sequential Algorithm 

For the sequential implementation of the Hough transform, three main parts can 
be distinguished: accumulation of the HT, elimination of peaks in the histograms 
that may not represent real tracks and grouping of neighbor peaks that may 
represent the same track^. After these phases, we can verify the efficiency of the 
algorithm. Steps 2 and 3 can be viewed as a filtering process in the parameter 
space, and these steps are quite important to avoid detecting ghost tracks. 

Some parameters used in steps 2 and 3 must be optimized in order to achieve 
the best efficiency. This was developed during a training phase. Three parame- 
ters are the most important ones: two for determining neighboring regions and 
the third for determining if a cell has a value high enough to represent a real 
track. The algorithm was trained for reaching the optimum values for the 79 
events available (total of 1321 helices) by considering the set of parameters that 
resulted in the highest efficiency in reconstructing the tracks. This efficiency is 
computed by averaging three figures of merit often used in track reconstruc- 
tion: precision (the rate between the number of real tracks reconstructed and 
total number of reconstructed tracks), recall (rate between the number of real 
tracks reconstructed and the actual number of real tracks in the input space) 
and goodness (rate between the difference in computing the total number of real 
tracks reconstructed and the number of ghost tracks detected and the actual 
number of real tracks present in the input space) ([5]). The optimized algorithm 
achieved a correct recognition of 86.5% (average efficiency) of the tracks, and 
99.5% of these tracks were correctly reconstructed. This corresponds to a reso- 
lution better than 10% in the momentum reconstruction of particles, according 
to an algorithm suggested in [2]). Fig. 7 shows the reconstruction for the event 
shown in Fig. 3. 

4.2 Using Data Parallelism 

Having developed the sequential algorithm, the next task was to parallelize it. 
The first approach was to use a master/slave architecture to implement data 
parallelism (see Fig. 8). A master process continuously receives data from the 
host machine and distributes them sequentially to free slaves that perform the 
reconstruction algorithm. The minimum number of slaves required by the appli- 
cation depends on the ratio between computing time and communication time, 
as when the slave that has first received data to process becomes free, it is useless 
to add processing nodes in the chain. 

When the accumulator reaches higher granularity, an actual peak may artihcially be 

split into two, so that grouping neighboring cells may be considered. 
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Fig. 7. Reconstructed event (compare with Fig. 3). 




Fig. 8. Master/Slave architecture. 



For this parallelization scheme, we obtained a speed-up (gain) of 14.7 for the 
SHT and 11.25 for the LHT. As the optimum numbers of slaves were respectively 
15 and 12, the parallel processing efficiencies were equal to 98% and 93.8%. Note 
that as the LHT algorithm is faster, its computation/communication time ratio 
is lower, less slaves are necessary to optimize parallelization, and so we get a 
lower speed-up. However, the absolute time of operation for the parallel LHT 
remains lower than that for the parallel SHT®. 



® A time of 2.02 seconds is necessary to process an event in the parallel LHT. This is 
an average value, because events have different numbers of tracks. 
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4.3 Using Instruction Parallelism 

The second approach to parallelizing the algorithm was to implement a form 
of instruction parallelism. In this case, slaves operate over the same data (dis- 
tributed by the master) but performing different parts of the whole reconstruc- 
tion algorithm. Due to interdependencies, this approach tends to be more com- 
munication intensive than the previous one based on data parallelism. 

The division of tasks among slaves was made as following: each slave was 
responsible for execntion of the whole Hough transform (either global or local) 
algorithm for the same data over only one region of the parameter space, accord- 
ing to Fig. 9. At step 2 of the reconstruction algorithm (elimination), each cell of 
the accumulator must know the value stored in all the cells laying in a neighbor- 
hood region (for the frontier cells, these regions are represented by the dashed 
lines in the figure). Therefore, before starting step 2, slaves must communicate 
in order to proceed in their tasks. 
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Fig. 9. The parameter space division and the neighborhood regions. 



The way slaves communicate is illustrated in Fig. 10. In order to reduce the 
communication overheads, only neighbor nodes communicate. After performing 
the elimination task and before the grouping phase, slaves must communicate 
again, sending the corrections their neighbors need. These corrections are a direct 
consequence of the scanning mechanism used to look for peaks in the histogram 
(from top to bottom, i.e., a line scanning mechanism). 

The main difference of this communication scheme from the previous one is 
that now the slaves send data only to the neighbors that are at their right or 
below them, due to the scanning direction (see Fig. 11). 
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Fig. 12 illustrates the whole procedure for implementing this instruction par- 
allelism. 




PHASES 



ACUMULATING HT 



ELIMINATING 



GROUPING 



VERIFYING 



Fig. 12. Data/Instruction flow in time. 



For this application, a speed-up equal to 11.8 and an efficiency of 78.8% (for 
15 slaves) were achieved for the SHT. This algorithm was not suitable for imple- 
menting the LHT, because the speed-up remained very low in comparison to the 
one with data parallelism due to the fact that, in the LHT, few communications 
are needed (computation is faster), whereas in this second application a lot of 
communication is established. 

5 Conclusions 

A track reconstruction algorithm for a scintillating fiber tracker in experimental 
high-energy physics was developed using Hough transforms. The algorithm was 
successfully implemented in a 16-node parallel machine. Two methods for parti- 
tion of the sequential implementation were developed, using data and instruction 
parallelism techniques. 

The reconstruction algorithm was able to identify correctly 86.5% of the 
tracks and allowed the computation of the momentum with a resolution better 
than 10% for 99.5% of the identified tracks. 
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The parallel approach was shown to be able to run this complex reconstruc- 
tion algorithm in about 2 seconds per event and, using data parallelism, a 98% 
parallelism efficiency was achieved. The algorithm is now being transported to 
a similar MIMD environment based on DSPs, in order to achieve a further im- 
provement in processing speed. 
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Chapter 5: 

Finite/Discrete Elements in Engineering 
Applications 



Introduction 



David R. Owen in his invited talk, Finite/ Discrete Element Analysis of Multi- 
fracture and Multi-contact Phenomena^ displays a variety of new problems where 
computer simulation was unthinkable until recently, because of both their com- 
plexity and the computer resources required. 

The paper by Martins et al. proposes a parallel edge-based finite-element 
technique that can lead to a more efficient usage of computer memory compared 
with traditional approaches. 

In the second selected paper, Cagniot et al. deal with parallelization, using 
High Performance Fortran, of a three-dimensional finite-element based code for 
electromagnetic simulation; the results were obtained on two architectures with 
a number of processors up to 16 , which show a degradation of parallel efficiency 
with the number of processors. 

Granular flows, as in Owen’s talk, and non-rigid solids, are the type of simu- 
lation most appropriate to the discrete approach as discussed in the contribution 
by Romero et al., which shows the simulation of a table cloth under windy con- 
ditions. The parallel conjugate gradient method is a key feature of the algorithm 
for fast cloth simulation, implemented in a shared memory architecture. 

The chapter concludes with the work of Gomes et ah, on the parallelization 
of a simulation code of liquid-liquid agitated columns. 
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Abstract. A dynamic domain decomposition strategy is proposed for 
the effective parallel implementation of combined finite/discrete element 
approaches for problems involving multi-fracture and multi-contact phe- 
nomena. Attention is focused on the parallelised interaction detection 
between discrete objects. Two graph representation models, are proposed 
and a load imbalance detection and re-balancing scheme is also suggested. 
Finally, numerical examples are provided to illustrate the parallel per- 
formance achieved with the current implementation. 



1 Introduction 

The last several decades have witnessed the great success of the finite element 
method, along with a tremendous increase of computer power, as a numerical 
simulation approach for applications across almost every engineering discipline. 
However, for situations involving discrete or discontinuous phenomena, a com- 
bined finite/discrete element method naturally offers a more powerful solution 
capability. Typical examples that can considerably benefit from this combined 
solution strategy include process simulation (e.g. shot peening, granular flow, 
and particle dynamics) and fracture damage modelling (e.g. rock blasting, min- 
ing applications, and projectile impact). 

Besides their discrete/discontinuous nature, these applications are often char- 
acterised by the following additional features: they are highly dynamic; with 
rapidly changing domain configurations; sufficient resolution is required; and 
multi-physics phenomena are involved. In the numerical solution context, contact 
detection and interaction computations often take more than half of the entire 
simulation time and the small time step imposed in the explicit integration pro- 
cedure also gives rise to the requirement of a very large number (e.g. millions) of 
time increments to be performed. For problems exhibiting multi-fracturing phe- 
nomena, the necessity of frequent introduction of new physical cracks and/or 
adaptive re-meshing at both local and global levels adds another dimension of 
complexity. All these factors make the simulation of a realistic application to be 
extremely computational intensive. 
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Consequently, parallelisation becomes an obvious option for significantly in- 
creasing existing computational capabilities, along with the recently remarkable 
advances in hardware performance. In recent years considerable effort has been 
devoted to the effective parallel implementation of conventional finite element 
procedures, mainly based on a static domain decomposition concept. The features 
itemised above associated with the applications of interest make such a paral- 
lelisation much more difficult and challenging. Only very recently, have some 
successful attempts emerged at tackling problems of a similar nature[l,2,3]. 

It is evident that a dynamic domain decomposition (DDD) strategy plays 
an essential role in the success of any effective parallel implementation for the 
current problem. The ultimate goal of this work is therefore to discuss the ma- 
jor algorithmic aspects of dynamic domain decomposition that make significant 
contributions to enhancing the parallel performance. Our implementation is also 
intended to be general for both shared and distributed memory parallel plat- 
forms. 

The outline of the current work is as follows: In the next section, a general 
solution procedure of the combined finite/discrete element approach for prob- 
lems involving material discontinuity and failure is reviewed and a more detailed 
description is devoted to the multi-fracture modelling and the interaction detec- 
tion; Section 3 presents dynamic domain decomposition parallel strategies for 
the combined finite/discrete element approach and parallel implementation for 
both the finite element computation and the global search is also investigated. 
Issues relevant to the dynamic domain decomposition algorithm for the contact 
interaction computation, mainly the two graph representation models for dis- 
crete objects, is proposed in Section 4. Section 5 is devoted to the description of 
a load imbalance detection and re-balancing scheme. Finally, numerical exam- 
ples are provided to illustrate the parallel performance achieved with the current 
implementation. 



2 Solution Procedures for Combined Finite/Discrete 
Element Approaches 

Engineering applications involving material separation and progressive failure 
can be found, amongst others, in masonry or concrete structural failure, demo- 
lition, rock blasting in open and underground mining and fracture of ceramic 
or glass-like materials under high velocity impact. The numerical simulation of 
such applications, especially in large scale, has proved to be very challenging. 
The problems are usually represented by a small number of discrete bodies prior 
to the deformation process. In the combined finite/discrete element context, the 
deformation of each individual body is modelled by the finite element discreti- 
sation and the inter-body interaction is simulated by the contact conditions. 
During the simulation process, the bodies are damaged, by, for example, tensile 
failure, and modelling of the resultant fragmentation may result in possibly two 
or three orders of magnitude more bodies by the end of the simulation. In addi- 
tion, the substantial deformation of the bodies may necessitate frequent adaptive 
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remeshing of the finite element discretisation. Therefore the configuration and 
number of elements of the finite element mesh and the number of bodies are 
continuously changing throughout the simulation. 

Similarly, process engineering applications often contain a large number of 
discrete bodies. In many cases, these bodies can be treated as rigid and repre- 
sented by discrete elements as simple geometric entities such as disks, spheres 
or ellipses. Discrete elements are based on the concept that individual material 
elements are considered to be separate and are (possibly) connected only along 
their boundaries by appropriate physically based interaction laws. The response 
of discrete elements depends on the interaction forces which can be short-ranged, 
such as mechanical contact, and/or medium-ranged, such as attraction forces in 
liquid bridges. 

Contact cases considered in the present work include node to edge/facet and 
edge/facet to edge/facet. Short-range mechanical contact also include disk/sphere 
to disk/sphere and disk/sphere to edge/facet. Medium-range interactions can be 
represented by appropriate attractive relations between disk/sphere and disk/ 
sphere entities. 

The governing dynamic equations of the system are solved by explicit time 
integration schemes, notably the central difference algorithm. 

2.1 Procedure Description 

In the context of the explicit integration scheme, a combined finite and discrete 
element approach typically performs the following computations at each time 
step: 

1. Finite element and fracture handling: 

— Computation of internal forces of the mesh; 

“ Evaluation of material failure criterion; 

— Creation of new cracks if any; 

— Global adaptive re-meshing if necessary; 

2. Contact/interaction detection; 

— Spatial search: detection of potential contact /interaction pairs among 
discrete objects; 

— Interaction resolution: determination of actual interaction pairs through 
local resolution of the kinematic relationship between (potential) inter- 
action pairs; 

— Interaction forces: computation of interaction forces between actual in- 
teraction pairs by using appropriate interaction laws. 

3. Global solution: computation of velocities and displacements for all nodes; 

4. Configuration update: update of coordinates of all finite element nodes and 

positions of all discrete objects; 

The procedures of finite element internal force computation, equation solu- 
tion and configuration update in the above approach are all standard operations, 
and therefore further description is not necessary. However, the fracture mod- 
elling and the interaction detection warrant further discussion. 
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2.2 Multi-fracturing Modeling 

Two key issues need to be addressed for successful modelling of material failure: 

(i) the development of constitutive models which reflect the failure mechanism; 

(ii) the ability of numerical approaches to handle the discontinuities such as shear 
bands and cracks generated during the material failure and fracture process. 



Failure Models A variety of constitutive models for material failure have ap- 
peared over the years, with softening plasticity and damage theory being the two 
most commonly adopted in the nonlinear finite element analysis of failure. For 
brittle materials, a simple Rankine failure model can be employed. After initial 
yield, a rotating crack formulation may be employed in which the anisotropic 
damage is modelled by degrading the elastic modulus in the direction of the 
current principal stress invariant. 

For all fracture or localisation models regularisation techniques must be intro- 
duced to render the mesh discretisation objective. Optional formulations include 
non-local damage models, Cosserat continuum approaches, gradient constitutive 
models, viscous regularisation and fracture energy releasing/ strain softening ap- 
proaches. All models effectively result in the introduction of a length scale and 
have specific advantages depending on the model of fracture and loading rate. 

More detailed description of various fracture models can be found in [4] . 



Topological Update Scheme The critical issue of fracture modelling is how 
to convert the continuous finite element mesh to one with discontinuous cracks 
and to deal with the subsequent interactions between the crack surfaces. The 
most general approaches permit fracture in an arbitrary direction within an 
element and rely on local adaptive re-meshing to generate a well-shaped element 
distribution. 

A particular fracture algorithm is developed in this work to model the failure 
of brittle materials subject to impact and ballistic loading. The fracture algo- 
rithm inserts physical fractures or cracks into a finite element mesh such that 
the initial continuum is gradually degraded into discrete bodies. However, the 
primary motivation for utilising the algorithm is to correctly model post-failure 
interaction of fractures and the motion of the smaller particles created during 
the failure process. 

Within this algorithm, a nodal fracture scheme is employed to transfer the 
virtual smeared crack into a physical crack in a finite element mesh. The scheme 
is a three stage procedure: (i) Creation of a failure map for the whole domain; 

(ii) Assessment of the failure map to identify where fractures should be inserted; 

(iii) Updating of the mesh, topology and associated data. 

In 2-D cases, the failure direction is defined to coincide with the maximum 
failure strain direction and the crack will propagate orthogonal to the failure 
direction. Associated with the failure direction, a failure plane is defined with 
the failure direction as its normal and the failed nodal point lying on the plane. 
A crack is then inserted through the failure plane. If a crack is inserted exactly 
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through the failure plane, some ill-shaped elements may be generated. Local re- 
meshing is then needed to eliminate them. Alternatively, the crack is allowed 
to propagate through the most closely aligned element boundary. In this way, 
no new elements are created and the updating procedure is simplified. However, 
this procedure necessitates a very fine mesh discretisation around the potential 
fracture area. Within the current algorithm a minimum element area criterion 
is used to ensure that excessively small sliver elements are not created. If the 
element area obtained by splitting the elements is below this threshold value then 
the fracture is forced to be along element boundaries. For 3-D situations, the 
corresponding fracture algorithm is basically the same but the implementation 
procedure is more complicated. 

In addition, an element erosion procedure is also applied to deal with the 
situation that the material represented by the element to be eroded no longer 
contributes to the physical response for the problem, such as the case where the 
material is melted down or evaporated at high temperature or is transformed 
into very small particles. 

2.3 Interaction Detection 

The interaction contact detection comprises three phases: (global) spatial search, 
(local) interaction resolution and interaction force computation. 

The spatial search algorithm employed is a combination of the standard 
space-cell sub-division scheme with a particular tree storage structure - the 
Augmented Digital Tree (ADT) [5] - and can accommodate various geometri- 
cal entities including points, facets, disks/spheres and ellipses. Each entity is 
represented by a bounding box extended with a buffer zone. The size of the 
buffer zone is a user-specified parameter. In the case of medium-range interac- 
tion, it must not be smaller than the influence distance of each object considered. 
The algorithm eventually locates for each object (termed the contactor) a list 
of neighbouring objects (termed potential targets) that may potentially interact 
with the contactor. 

In the second phase, each potential contactor-target pair is locally resolved 
on the basis of their kinematic relationship, and any pair for which interaction 
does not occur, including medium range interaction if applied, is excluded. In 
the final phase, the interaction forces between each actual interaction pair are 
determined according to a constitutive relationship or interaction law. 

Effects of Buffer Zone Sizes The global spatial search may not necessarily 
be performed at every time step, as will be described below, while the interac- 
tion resolution and interaction force computations should be conducted at each 
time step. Furthermore, some kinematic variables computed in the interaction 
resolution phase will be used in the force computation. For these reasons, the 
interaction resolution and force computations are actually performed together 
in the current implementation. 

The computational cost involved in the interaction force computation phase 
is fixed at each time step, if no new surfaces are created during the fracturing 
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process. It may, however, vary significantly for the other two phases if different 
sizes of buffer zone are specified. 

Basically, the size of buffer zone has conflicting effects on the overall costs of 
the global search and the local interaction resolution. First of all, after perform- 
ing a global search at one time step, a new search is required to be performed 
only if the following condition is satisfied 

^max > huff ( 1 ) 

where Ibuf / is the size of the buffer zone; and Imax is the maximum accumulated 
displacement of a single object for all discrete bodies in any axial direction after 
the previous search step: 

Imax = max i e [l,nbody],j e [l,ndim] (2) 

^ k=l 

in which vj is the velocity of object i in the j direction; Atk the length of the 
time step at the fc-th increment after the global search; Ubody the total number 
of objects and Udim the number of space dimensions. 

Given a larger buffer zone, the spatial search will create a longer list of 
potential targets for each contactor, which will increase the amount of work 
in the phase of interaction resolution in order to filter out those pairs not in 
potential interaction. On the other hand, the global search will be performed 
with less frequency which leads to a reduced overall cost in the spatial search 
phase. With a smaller buffer zone, the potential target list is shorter and the 
local interaction resolution becomes less expensive at each step, but the global 
search should be conducted more frequently thus increasing the computational 
cost in searching. 

A carefully selected buffer zone can balance the cost in the two phases to 
achieve a better overall cost in interaction detection. Nevertheless, an optimal 
buffer zone size is normally difficult to select for each particular application. 



Incremental Global Search Generally speaking, the spatial search is an ex- 
pensive task that often consumes a substantial portion of the total simulation 
time if each new search is performed independently from the previous search. 
The cost could however be reduced to some extent if the subsequent searches 
after the initial one are conducted in an incremental manner. In this incremen- 
tal approach, the tree structure (ADT) required in the current search can be 
obtained as a modification of the previous structure and some search operations 
can also be avoided. 

This approach is based on the observations that even though the configura- 
tion may undergo significant changes during the whole course of the simulation, 
the actual change between two consecutive search steps is bounded by the buffer 
zone and therefore is local. In addition, the space bisection tree is characterised 
by the fact that each node in the tree represents a subspace of the whole sim- 
ulation domain. As long as each node (i.e. object) itself stays in the subspace 
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it is associated with, the whole tree will not need to be modified at all. Conse- 
quently, in the new search, it is possible to avoid building an entirely new tree 
by modifying only the subtrees that are affected by those objects which have 
moved out of their represented subspace. 

The detail of the above (incremental) global search algorithm can be found 
in [5] . A more detailed description for interaction detection can be found in [4] , 
while the interaction laws applied to particle dynamics, and shot peening in 
particular have been discussed in [6,7]. 



3 Parallel Implementation Strategies 

Domain decomposition based parallelisation has been established as one of the 
most efficient high level (coarse grain) approaches for scientific and engineering 
computations, and also offers a generic solution for both shared and distributed 
parallel computers. 

A highly efficient paraffei impfementation requires two, often competing, ob- 
jectives to be achieved: a weff bafanced workfoad among the processors, and a 
low level of interprocessor communication overhead. 

For conventional finite element computations with infrequent adaptive remesh- 
ing and without contact phenomena, a static domain decomposition is generally 
an ideal solution that initially distributes the sub-domain data to each processor 
and redistributes them after each adaptive remeshing. A well load-balanced situ- 
ation can be achieved if each processor is assigned an equal number of elements. 
Interprocessor communications can be reduced to a low level if the interface 
nodes that are shared by more than one sub-domain are small. 

For the current situation involving both finite and discrete elements, to ap- 
ply a single static domain decomposition for both finite and discrete element 
domains will apparently not achieve a good parallel performance due to the 
highly dynamic evolution of the configuration concerned. Therefore a dynamic 
domain decomposition strategy should be adopted. 

3.1 Dynamic Domain Decomposition 

The primary goal of the dynamic domain decomposition is to dynamically assign 
a number of discrete elements or objects to each processor to ensure good load 
balance as the conhguration evolves. 

Besides the same two objectives as for a static domain decomposition, the 
dynamic domain decomposition should also achieve an additional two objectives: 
minimum data movement and efficiency. 

Firstly, completely independent domain decompositions between the consec- 
utive steps normally give rise to very large amount of data movement involved 
among the processors, leading to a substantial communication overhead. There- 
fore, the dynamic domain decomposition should provide efficient re-partitioning 
afgorithms that can keep the domain partitioning constant as much as possibfe 
during the simufation. Secondly, since the partitioning may need to be performed 
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many thousands of times during simulations, the dynamic domain decomposition 
must be very efficient. 

Most dynamic domain decomposers can be classified into two general cat- 
egories: geometric and topological. Geometric partitioners divide the computa- 
tional domain by exploiting the location of the objects in the simulation, while 
topological decomposers deal with the connectivities of interactions instead of 
geometric positions. Both methods can be applied to both finite element and 
discrete element simulations. Generally, topological methods can produce bet- 
ter partitions than geometric methods, but are more expensive. A particular 
topological based decomposer called (Par)METIS [8] is chosen as the domain 
partitioner in our implementation as it appears to meet the above criterion for 
an effective domain decomposition algorithm. 

When applying the dynamic domain decomposition strategy to the current 
finite/discrete element simulation, there are two different approaches. The first 
approach dynamically decomposes both finite and discrete element domains. 
Different characteristics of the two parts makes it very difficult to achieve a 
good load balance. 

An alternative solution, which is employed in this work, is to decompose the 
computations associated with the two domains separately. Within the interac- 
tion detection of the discrete elements, slightly modified schemes are employed 
for the global search, and the combined interaction resolution and force compu- 
tations. This strategy provides maximum flexibility in the implementation and 
thus allows the most suitable methodologies and data structures to be applied 
in different stages of the simulation. This may however cause extra commu- 
nications in information exchange among processes due to the data structure 
inconsistencies between different solution stages at each time step. 

Dynamic decomposition of finite element computations and a particular par- 
allel implementation version of the global search are discussed below, while the 
parallel issues for the combined interaction resolution and force computation, 
including dynamic domain partitioning and dynamic load re-balancing, will be 
addressed respectively in the next two sections. 



3.2 Dynamic Decomposition of Finite Element Computations 

Compared to the discrete element domain, the configuration change of the finite 
element mesh is relatively less frequent and insignificant. The major concern for 
a dynamic domain decomposition algorithm is its ability to adaptively partition 
the domain so as to minimize the cost due to the data migration among different 
processors after a new partitioning. 

To achieve a well-balanced workload situation often needs a fine tuning. In 
the case that different elements have different orders (e.g. linear or quadratic), 
and/or different material models, and/or different stress states (e.g. elastic or 
plastic), thus having different computational effort, each element should be 
weighted proportional to its actual cost to ensure a good load balance. In het- 
erogeneous computer clusters, an uneven workload decomposition should be ac- 
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Box 1: Parallel Implementation of Initial Global Search 



1. Globally distribute objects among processors using any available means of 
domain decomposition; 

2. Each processor defines the bounding box of each object with an extended 
buffer zone in its subdomain; 

3. Each processor computes the global bounding box of its domain and broad- 
casts this information to other processors; 

4. Each processor applies the sequential global search algorithm to its domain 
and constructs its own ADT and potential target lists; 

5. Each processor identifies in its domain the objects that overlap with the 
global bounding boxes of the other domains and then sends the objects to 
the processors owning the overlapping domains; 

6. Each processor receives from other processors the objects that overlap 
with its own global bounding box; conducts the search for each object in 
its ADT to build the potential target list, if any; and sends the result back 
to the processor to which the object originally belongs; 

7. Each processor collects the lists created in other processors and merges 
them to the original lists to obtain the final lists. 



complished to take into account different computing powers that each member 
machine of the group can offer. 



3.3 Dynamic Parallelisation of Global Search 

Global search is found to be the most difficult operation to be parallelised effi- 
ciently due to its global, irregular and dynamic characteristics. In our parallel 
implementation, some domain decomposition strategies are also employed to 
both initial and subsequent incremental search steps. Box 1 outlines the algo- 
rithmic steps involved in the initial global search. 

As the initial search is conducted only once, the performance of the algorithm 
is not a major concern. Therefore, it can employ any available means to distribute 
the objects among the processors. In some cases, the algorithm can even be 
performed sequentially. 

For the subsequent incremental searches, an effective parallel implementation 
of the algorithm becomes critical. Box 2 presents the corresponding parallel 
algorithm [9] and only the differences from the previous initial search approach 
are listed. 

In the algorithm, it is essential to assume that a good dynamic object dis- 
tribution is performed in order to achieve a good load balance, and one such 
distribution scheme will be proposed in the next section. 





492 



D.R.J. Owen et al. 



Box 2: Parallel Implementation of Incremental Global Search 



1. Each processor migrates those objects that are assigned to different do- 
mains in the current partition to their new processors; 

2. Each processor defines/modifies the bounding box of each object with an 
extended buffer zone in its subdomain; 

3. Each processor updates the global bounding box of its domain and broad- 
casts this information to other processors; 

4. Each processor modifies its own ADT and constructs its potential target 
lists; 

5-7. Same as the Implementation in Box 1. 



4 Dynamic Domain Decomposition for Interaction 
Computations 

Many scientific and engineering applications can be abstractly expressed as 
weighted graphs. The vertices in the graph represent computational tasks and the 
edges represent data exchange. Depending on the amount of computation per- 
formed by each task, the vertices are assigned a proportional weight. Similarly, 
the edges are assigned weights that reflect the data needed to be exchanged. 

A domain partitioning algorithm aims to assign each processor a subset of 
vertices whose total weight is as the same as possible so as to balance the work 
among the processors. At the same time, the algorithm minimises the edge-cut 
(subject to load-balance requirements) to minimise the communication overhead. 

As the simulation evolves, the computational work associated with an object 
can vary, so the objects may need to be redistributed among the processors to 
balance the workload. The objective of dynamic re-partitioning is therefore to 
compute a balanced partitioning more effectively that minimises the edge-cut, 
and to minimise the amount of data movement required in the new partitioning. 

4.1 A Priori Graph Representation of Discrete Objects 

The first important step for the development of an efficient dynamic partition- 
ing for the interaction computation lies in an appropriate graph representation 
of discrete elements. Unlike a finite element mesh, discrete objects do not have 
underlying connectivity to explicitly associate with, and thus some form of con- 
nections among the objects should be established. An obvious choice is to use 
the potential target list of each contactor as its connectivity, upon which a graph 
representation of the interaction computation can therefore be established. 

Each vertex is initially assigned a weight that is equal to the number of the 
targets in its potential list and no weight is placed on the edges. This weighting 
scheme will work reasonably well if the simulation is dominated by one type 
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of interaction and the objects have nearly even distribution across the whole 
domain. 

As the above graph model is established based on a priori estimation of 
the interaction relationship and the cost of interaction force computation, it is 
therefore termed the a priori model. 

This special choice of connectivity has the following implications: First, as 
the potential target list will not be changed in two consecutive global searches, 
the graph partitioning may need to be performed only once after each global 
search rather than at each time step. Second, depending on the size of buffer 
zones specified, the list is only an approximation to the actual interaction rela- 
tionship between the contactor and target. Furthermore, the actual relationship 
between objects is also undergoing dynamic changes at each time step. These 
considerations indicate that the resulting graph may not be a perfect represen- 
tation of the problem under consideration. Consequently some degree of load 
imbalance may be expected. 

4.2 A Posteriori Draph Representation 

In order to improve the accuracy of the above graph model, the following inherent 
features in the current solution procedure for the interaction resolution and force 
computation should be addressed: 

— The target lists created by the global search provide only a potential interac- 
tion relationship between contactors and targets, and this relation can only 
be established after the local resolution, and therefore can not be accurately 
established a priori; 

— The computation costs associated with each object, including the kinematic 
resolution and interaction force computation, are also unknown a priori, and 
are also very difficult to be measured precisely; 

— Both the interaction relationship and the computational cost for each object 
are undergoing constant changes at each time step. 

By specifying a very small buffer zone, the potential interaction lists can 
be much closer to the actual interaction relationship, but this will significantly 
increase the costs of simulation as indicated earlier, and therefore is not a suitable 
option. 

In view of these facts, an alternative graph representation model is suggested. 
The basic idea is to use the actual information obtained in the previous time 
step as the base for modelling the situation at the current step. More specifically, 
a (nearly) accurate graph model is built for the previous time step. The graph 
is based on the actual contactor-target list as the connectivity and the (nearly) 
actual computational cost for each contactor as the vertex weighting. Since the 
computation domain will not undergo a significant change in two consecutive 
time steps as a result of a small time increment, it is reasonable to assume that 
this graph model is also a good representation of the problem at the current time 
step. Due to the a posteriori nature, this model is thus termed the a posteriori 
graph representation. 
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Another advantage of this a posteriori model is that the global search and 
the domain partitioning can now be performed at different time instances and 
intervals. This means that the global search is performed when the potential 
interaction list will be no longer valid; while a dynamic graph partitioning can 
be conducted when load balancing is required to be maintained. The latter aspect 
will be further expanded upon in the next section. 

4.3 Integration with Global Search 

In principle, both global search and graph partitioning can have totally inde- 
pendent data structures, mainly with different object distribution patterns. In 
distributed memory platforms, if the same object is assigned to different do- 
mains/processors in the search and partitioning phases, some data movement 
between two processors becomes inevitable. The extra communication results 
from the data structure inconsistency between the two phases. If the two data 
structures can be integrated to a certain degree, the communication overhead 
can be reduced. 

In the parallel implementation, the data associated with a particular object, 
such as coordinates, velocities, material properties, forces and history-dependent 
variables, is more likely to reside on the processor determined by the graph 
partitioning. Therefore, it is a good option for the global search to use the same 
object distribution pattern as generated in the graph partitioning phase, i.e. a 
single dynamic domain decomposition is applied to all computation phases in 
the interaction detection step. 

Another advantage of this strategy is that a good load balance may also be 
achieved in the global search. This can be justified by the fact that a similar total 
number of potential interaction lists among the processors also implies a similar 
number of search operations. In addition, the incremental nature of the dynamic 
repartitioning employed ensures that only a small number of objects is migrated 
from one domain to another, and hence only a limited amount of information is 
required to be exchanged among the processors when re-constructing the ADT 
subtrees and potential target lists in the incremental global search. 

5 Dynamic Load Re-balancing 

The domain partitioning algorithm is supposed to generate a well load-balanced 
partitioning. For various reasons, this goal may, however, be far from easy to 
achieve, especially when dynamic domain decomposition is involved. It is essen- 
tial, therefore, to have a mechanism in the implementation that can detect the 
problem of load imbalance when it occurs and take proper actions to restore, 
or partially restore, the balance if necessary. As a matter of fact, dynamic load 
re-balancing is an integrated part of the dynamic domain decomposition which 
determines the time instances at which a dynamic re-partition should be per- 
formed. In this section, a dynamic load balancing scheme will be proposed in 
an attempt to enhance the performance of the parallelised interaction resolution 
and force computations. 
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5.1 Sources of Load Imbalance 

Load imbalance may be caused by the following factors: 

— A perfect partition may not be able to achieve a perfect CPU time balance 
due to sophisticated hardware and software issues; 

— The partition produced by the domain decomposer is not perfectly balanced 
in terms of workload; 

— In the finite element computation phases, imbalanced workload may arise 
when mechanical properties of some elements change, for instance, from elas- 
tic to plastic; or when local re-meshing happens that leads to some extra or 
fewer number of elements for certain domains; 

— In the interaction resolution and force computation phases, workload unbal- 
ancing may happen for two reasons: The connectivity for each object used 
in the graph is only an approximation to the real interaction relationship as 
addressed earlier; and/or the assigned weighting for each object /edge may 
not represent the real cost associated with the object/edge; 

The first two sources of load imbalance are beyond the scope of the present 
work and thus will not be given further consideration. 

An important part of the load rebalancing scheme is to obtain a fairly accu- 
rate measurement of the actual cost for each object. This is however not easy to 
fulhll. The difficulty is due to the fact that the relative difference in computation 
costs between different types of interaction, such as sphere to sphere and sphere 
to facet, and sphere to sphere and node to facet, are difhcult to define accurately. 

A possible solution to this difficulty is by means of numerical experiment. We 
can design a series of experiments on one particular system to establish a relative 
cost of each element operation, including each type of interaction resolution and 
force computation. This relative cost, however, should be updated accordingly 
if the program is to be run on a new system. 

With the relative cost model, we can compute the number of different com- 
putation operations an element/object participates in at each time step and then 
calculate its actual cost upon which the assignment of a proportional weighting 
to the element/object in the graph is based. This provides a basis on which 
further operations aimed at maintaining load balancing can take place. 



5.2 Imbalance Detection and Re-balancing 

Most load re-balancing schemes consist of three steps: imbalance detection, re- 
balancing criterion and domain re-decomposition. 

The first step of load re-balancing schemes is to constantly monitor the level 
of load imbalance among the processors during the simulation. The workload of 
one processor at each time step can be measured, for instance, by summing up 
the relative costs of all objects in the processor. These costs are already computed 
during the interaction computation by following the procedure outlined earlier. 
Alternatively, the workload can also be accurately obtained by measuring its 
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actual runtime, but this is not trivial. Using either approach, the level of load 
imbalance can be defined as 



B = 



Wrr 



- w 



w 



( 3 ) 



Wmax = max Wi, W = /up 

i=l 

and Wi is the workload of the i-th processor; and Up is the number of processors. 

The next step is to decide when to perform a new graph partitioning to re- 
balance the load. Such a decision requires consideration of complicated tradeoffs 
between the cost of the graph partitioning, the quality of the new partitioning 
and the redistribution cost. A simple approach is to start the re-partitioning 
when the level of imbalance, B, exceeds a prescribed value, r, i.e. 



B>t 



( 4 ) 



Similar to the size of buffer zone, r may also need to be carefully tuned in order 
to achieve a better overall performance. 

The final step is to perform a domain re-partitioning as described in the 
previous sections when condition (4) is satisfied. 



6 Numerical Experiments 

In this section, three examples are provided to illustrate the performance of 
the algorithms and implementation suggested. As the parallel performance of a 
domain decomposition for conventional finite elements is well established, the 
experiments will focus on the interaction resolution and force computations in 
both 2D and 3D cases. 

In addition, in view of the fact that the efficiency of a parallelised program 
is often affected to some extent by complex hardware and software issues, the 
contribution to the overall performance from only the algorithmic perspective is 
therefore identified. More specifically, the following issues will be investigated: 
1) the cost of the dynamic domain partitioning and repartitioning in terms of 
CPU time, and the quality of the partitioning; 2) the behaviour of two proposed 
graph representation models in terms of load balancing. 

The parallelised finite/discrete element analysis program is tested on an SGI 
Origin 2000 with 8 processors. Due to the shared memory feature of the ma- 
chine, interprocessor communication overhead plays a much less active role in 
the parallel performance. Each example is respectively tested with 1, 2, 4 and 6 
processors. 

The type of interaction considered between discrete elements is standard 
mechanical contact and the contact cases include node to edge, disk to disk and 
disk to edge contact in 2D, and sphere to sphere and sphere to 3-noded facet 
contact in 3D. 
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6.1 Example 1: 2D Dragline Bucket Filling 

This example simulates a three-stage dragline bucket filling process, where the 
rock is modelled by discrete elements as disks, while the bucket is modelled 
as rigid, and the filling is simulated by dragging the bucket with a prescribed 
motion. 

Figs. 1 and 2 respectively illustrate the initial configuration of the problem 
and two subsequent stages of the simulation. The discrete element model contains 
over 5000 disks with a radius of 20 units. The size of buffer zone is chosen to 
be 1 unit, and the time step is set to be 2.6 x 10“® sec that leads to a total 
number of 1.6 million time increments required to complete the simulation. With 
the current buffer zone size, the global (re-)search is performed at about every 
30 ~ 50 steps and takes about 12.9% of the total CPU time in the sequential 
case. 

For the dynamic domain partitioner employed, the CPU time consumed is 

0. 7% of the total time in the sequential case and 1.2% with 6 processors. These 
indicate that the re-partitioning algorithm in ParMETlS is efficient. It is also 
found that the partitioner produces partitioning with up to 3.5% load imbalance 
in terms of the weighted sub-total during the simulation. 

Fig. 3 demonstrates the necessity of employing a dynamic re-partitioning 
scheme, in which, the sub-domain distributions obtained by the re-partitioning at 
two stages are compared with those produced by a completely new partitioning at 
each occasion. It confirms that a series of consistent partitions that minimise the 
redistribution cost will not be achieved unless a proper re-partitioning algorithm 
is adopted in the dynamic domain decomposition. Note that, for better viewing, 
a much larger disk radius is used in Fig. 3. 

The CPU time of each processor for the contact resolution and force com- 
putations using the a priori graph model at the first 500k time increments is 
shown in Fig. 4a with various number of processors, where it is clearly illus- 
trated that severe load imbalance occurs. This can be explained, for instance in 
the 2-processor case, by the fact that although the domain is well partitioned 
according to the potential contact lists, the actual computation cost on the sec- 
ond sub-domain is much less because the disks in this region are more scattered 

1. e. more false contact pairs are included in the corresponding lists. 

A much better load balancing situation, depicted in Fig. 4b, has been achieved 
by the a posteriori graph model together with the proposed load re-balancing 
scheme. Table 1 also presents the overall speedup obtained by these two models 
with different number of processors. 



Table 1. Speedup obtained by two graph models for Example 1 



Model 


2 processors 


4 processors 


6 processors 


a priori model 


1.63 


3.20 


4.41 


a posteriori model 


1.86 


3.55 


5.01 





498 D.R.J. Owen et al. 




Fig. 1. Example 1 - Dragline bucket filling: Initial configuration 




Fig. 2. Example 1 - Dragline bucket filling: Configurations at two stages showing 
particle velocities: (a) t=2s; (b) t=3.3s 
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Fig. 3. Example 1 - Dragline bucket filling: Domain partitions at two differ- 
ent time instants: complete partitioning (left column) and re-partitioning (right 
column) 



f 

i 




(a) 



(b) 



Fig. 4. Example 1 - Dragline bucket filling: CPU time of each processor: (a) 
the a priori model; (b) the a posteriori, model 
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Talilo 2. Speedup obtained by two graph models for Example 2 



Model 


2 processors 


4 processors 


6 processors 


a priori model 


1.86 


3.55 


4.72 


a posteriori, model 


1.88 


3.65 


5.23 



6.2 Example 2: 3D Hopper Filling 

Tlie second example performs simulations of a 3D ore hopper filling process. The 
ore particles are represented by discrete elements as spheres and the hopper and 
the wall are assumed to be rigid. The particles are initially regularly packed at 
the top of the hopper and then are allowed to fall under the action of gravity. 
The configuration of the problem at an intermediate stage of the simulation is 
illustrated in Fig. 5. 




Fig. 5. Example 2 - 3D Hopper filling: Configuration at t=0.75s 



The radius of the spheres is 0.2.5 units and the buffer zone is set to be 0.0075 
units. The global search is conducted less frequently at the beginning and end 
of the simulation due to a relatively small velocity of motion. 

A similar quality of the partitioning as in the previous example is observed 
in this example. Table 2 show's the speedup achieved by both graph models 
with various number of processors. It appears that the a priori grap>h model 
exhibits a similar performance as the a posteriori model for the cases of 2 and 
4 processors, but shows a performance degradation in the 6-processor case. The 
rea.son for this is because of the symmetry in both x- and y-directions in the 
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problem, the domain decomposer produces a well balanced partitioning, in terms 
of actual computation cost, in the 2 and 4 processor cases, but fails to achieve 
the same quality of partitioning in the case of 6 processors. Also note that since 
the computational cost associated with each contact pair in 3D is more expensive 
than that in 2D, a slightly better overall parallel performance is achieved in this 
example. 

6.3 Example 3: Axisymmetric Layered Ceramic Target Impact 

This example consists of a tungsten long rod impacting a composite target com- 
prising an RHA backing block, three ceramic tiles of approximate thickness 
25mm, with 6mm RHA cover plate. The ceramic tiles are unconfined. The com- 
putational model is axisymmetric with the tile radius set as 74mm. Four noded 
quadrilateral elements are used to represent the metal components and three 
noded triangular elements are used for the ceramic. Each component is initially 
set up as an individual discrete body with contact conditions between the bod- 
ies modelled using Coulomb friction. The centreline is modelled using a contact 
surface as a shield to prevent any object crossing the symmetry axis. The initial 
finite element discretisation is shown in Fig. 6. 

The tungsten is modelled using an Armstrong- Zerilli AZUBCC model whilst 
the RHA is modelled using an AZMBCC model. Topological changes via erosion 
due to plastic strain is employed for both materials. The ceramic is treated as a 
brittle fracturing material and is modelled using the rotating crack model. 

Two impact velocities, 1325m/s and 1800m/s, are considered. 

The development of damage in the ceramic with increasing penetration at 
different stages is shown in Figs. 7a -7d. The configuration at t = 200/rs is also 
depicted in Fig. 8. 

Similar to the previous examples, the a posteriori graph model for discrete 
objects achieves better performance, which is demonstrated in Table 3. 



Table 3. Speedup obtained by two graph models for Example 3 



Model 


2 processors 


4 processors 


6 processors 


a priori model 


1.82 


3.52 


4.60 


a posteriori model 


1.86 


3.63 


5.33 



7 Concluding Remarks 

The algorithmic aspects of a parallel implementation strategy for a combined 
finite and discrete element approach are presented in this work. The main fea- 
tures of the implementation include: 1) a dynamic domain decomposition is 
applied independently to the finite element computation, the contact detection 
and discrete element computation; 2) different methodologies can be employed 
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in the global search and the interaction computations; 3) a dynamic graph re- 
partitioning is used for the successive decomposition of the moving configuration; 
4) two graph models are proposed for the representation of the relationship be- 
tween the discrete objects; 5) load imbalance can be monitored and re-balanced 
by the proposed scheme. 




Fig. 6. Example 3 - Ceramic target impact: (a) initial mesh; (b) zoomed mesh. 
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Fig. 7. Example 3 - Ceramic target impact: Progressive damage indicating 
regions with radial fractures 
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Fig. 8. Example 3 - Ceramic target impact: The configuration at t= 200/iS 
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By means of numerical experiment, the performance of the proposed algo- 
rithms is assessed. It is demonstrated that the dynamic domain decomposition 
using the second graph model with dynamic re-partitioning, load imbalance de- 
tection and re-balancing scheme can achieve a high performance in applications 
involving both hnite and discrete elements. It is worth mentioning that the 
strategy suggested can also be applied to other areas such as smooth particle 
hydrodynamics (SPH), meshless methods, and molecular dynamics. 
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Abstract. Parallel edge-based data structures are used to improve 
computational efficiency of Inexact Newton methods for solving finite element 
nonlinear solid mechanics problems on unstmctured meshes composed by 
tetrahedra or hexaedra. We found that for tetrahedral meshes, the use of edge- 
based data structures reduce memory requirements to hold the stiffness matrix 
by a factor of 7, and the number of floating point operations to compute the 
matrix-vector product needed in the iterative driver of the Inexact Newton 
method by a factor of 5. For hexahedral meshes the reduction factors are 
respectively 2 and 3. 



I. Introduction 

Predicting the three-dimensional response of large-scale solid mechanics problems 
undergoing plastic deformations is of fundamental importance in several science and 
engineering applications. Particularly in the Oil and Gas Industry, solid mechanics is 
being used to improve the understanding of complex geologic problems, thus helping 
to reduce risks and operational costs in exploration and production activities 
(Arguello, 1998). 

Traditional finite element technology for nonlinear quasi-static problems involves 
the repeated solution of systems of sparse linear equations by a direct solution 
method, that is, some variant of Gauss elimination. The updating and factorization of 
the sparse global stiffness matrix can result in extremely large storage requirements 
and a very large number of floating point operations. 

Explicit quasi-static nonlinear finite element technologies (Biffle, 1993), on the 
other hand, may be employed, reducing considerably memory requirements. Although 
robust and straightforward to implement, explicit schemes, based on dynamic 
relaxation or nonlinear conjugate gradients may suffer from low convergence rates. 

In this paper we employ an Inexact Newton method (Kelley, 1995), to solve large- 
scale three-dimensional incremental elastic-plastic finite element problems found in 
geologic applications. In the Inexact Newton Method, at each nonlinear iteration, a 
linear system of finite element equations is approximately solved by the 
preconditioned conjugate gradient method. The computational kernels of the Inexact 
Newton Methods, besides residual evaluations and stiffness matrix updatings, are the 

J. M.L.M. Palma et al. (Eds.): VECPAR 2000, LNCS 1981, pp. 506-518, 2001. 
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same of the iterative driver, that is, matrix-vector products and preconditioning. 
Matrix-vector products can be optimized using edge-based data structures, typical of 
computational fluid dynamics applications (Peraire, 1992, Luo, 1994). For 
unstructured grids composed by tetrahedra we found that the edge-based data 
structures reduce memory requirements to hold the stiffness matrix by a factor of 7. 
Further, the number of floating point operations to compute the matrix-vector product 
is also reduced by a factor of 5. For grids composed by trilinear hexaedra memory is 
reduced by a factor of 2, while the number of floating point operations decreases by a 
factor of 3. 

The remainder of this work is organized as follows. In the next section we briefly 
review the governing nonlinear finite element equations and the Inexact Newton 
methods. Section 3 describes the edge-based data structures for solid mechanics. 
Section 4 shows the numerical examples. The paper ends with a summary of the main 
conclusions. 



2. Incremental Equilibrium Equations and the Inexact Newton 
Method 



The governing equations for the quasi-static deformation of a body occupying a 
volume O is. 






+ pbi = 0 



in Q 



( 1 ) 



where o}, is the Cauchy stress tensor, x, is the position vector, p is the weight per unit 
volume and bj is a specified body force vector. Equation (1) is subjected to the 
kinematic and traction boundary conditions. 



M,(x,t) = M,(x,t) in r^j-, Gyn^- -hj{x,t) in 



( 2 ) 



where represents the portion of the boundary where displacements are prescribed 
( U - ) and Fh represents the portion of the boundary on which tractions are specified 
(/i,). The boundary of the body is given by F = u , and t represents a pseudo- 
time (or increment). Discretizing the above equations by a displacement-based finite 
element method we arrive to the discrete equilibrium equation. 



^int +^ext 



(3) 



where F,„, is the internal force vector and Fg^, is the external force vector, accounting 
for applied forces and boundary conditions. Assuming that external forces are applied 
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incrementally and restricting ourselves to material nonlinearities only, we arrive, after 
a standard linearization procedure, to the nonlinear finite element system of equations 
to be solved at each load increment. 



KjAu - R 



(4) 



where Kj is the tangent stiffness matrix, function of the current displacements, Au is 
the displacement increments vector and R is the unbalanced residual vector, that is, 
the difference between internal and external forces. 



Remark. We consider here perfect-plastic materials described by Mohr-Coulomb 
yield criterion. Stress updating is performed by an explicit, Euler-forward 
subincremental technique (Crisfield, 1990). 

Some form of Newton method generally solves the nonlinear finite element system 
of equations given by Eq. (4), where the tangent stiffness matrix has to be updated 
and factorized at every nonlinear iteration. This approach is known as Tangent 
Stiffness (TS) method. The burden of repeated stiffness matrix updatings and 
factorizations is alleviated, at the expense of more iterations, by: keeping the tangent 
stiffness matrix frozen within a load increment; iterating with the elastic stiffness 
matrix, known as the Initial Stress (IS) method. For solving large-scale problems, 
particularly in 3D, it is more efficient to solve approximately the linearized problems 
by suitable inner iterative methods, such as preconditioned conjugate gradients. This 
inner-outer scheme is known as the Inexact Newton method, and the convergence 
properties of its variants, the Inexact Initial Stress (IIS) and Inexact Tangent Stiffness 
(ITS) methods have been analyzed for von Mises materials by Blaheta and Axelsson 
(1997). We introduce here a further enhancement in IIS and ITS methods, by 
choosing adaptively the tolerance for the inner iterative equation solver according to 
the algorithm suggested by Kelley (1995). We also include in our nonlinear solution 
scheme a backtracking strategy to increase the robustness of the overall nonlinear 
solution algorithm. 



3. Edge-Based Data Structures 

Edge-based finite element data structures have been introduced for explicit 
computations of compressible flow in unstructured grids composed by triangles and 
tetrahedra (Peraire, 1992, Luo, 1994). It was observed in these works that residual 
computations with edge-based data structures were faster and required less memory 
than standard element-based residual evaluations. We have studied edge-based data 
structures for the implicit finite element solution of potential flow problems (Martins 
et al, 1997). Following these developments, for solid mechanics problems, we may 
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derive an edge-based finite element scheme by noting that the element matrices can 
be disassembled into their edge contributions as, 



K‘^ 



= ik: 



s=\ 



(5) 



where is the contribution of edge s to K" and m is the number of element edges, 
which is 6 for tetrahedra or 28 for hexaedra. 

Denoting by f the set of all elements sharing a given edge s, we may add their 
contributions, arriving to the edge matrix. 






( 6 ) 



The resulting matrix is symmetric, and we need to store only the upper off diagonal 
3x3 block per edge. The edge-by-edge matrix-vector product may be written as. 



nedge.s 

Kp= 'LK.sPs 



s=\ 



(7) 



where nedges is the total number of edges in the mesh and is the restriction of p to 
the edge degrees-of-ffeedom. In Table 1 we compare the storage requirements to hold 
the coefficients of the element stiffness matrices and the edge stiffness matrices as 
well as the flop count and indirect addressing (i/a) operations for computing matrix- 
vector products using element and edge-based data structures for tetrahedral meshes. 
All data in these tables are referred to modes, the number of nodes in the finite 
element mesh. According to Lohner (1994), the following estimates are valid for 
unstructured 3D grids, nel ~ 5.5xnnodes, nedges - Ixnnodes. 



Table 1. Memory to hold the stiffness matrix coefficients and computational costs for element 
and edge-based matrix-vector products for tetrahedral finite element meshes 



Data Structure 


Memory 


flop 


i/a 


EBE 


429 X nnodes 


1,386 X nnodes 


198 X nnodes 


Edges 


63 X nnodes 


252 X nnodes 


126 X nnodes 



For meshes composed by 8-noded hexaedra we performed a study to access the 
asymptotic ratio between the number of edges and the number of elements. Figure 1 
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shows, for an increasing number of divisions along each direction of a cube, the 
computed ratio between the number of resulting edges and the number of hexahedral 
finite elements in the meshes. We may note that the curve tends to an asymptotic ratio 
of 13, 




Fig 1. Edges/elements ratio on a cube. 



Considering the computed asymptotic ratio, we built Table 2, which compares 
memory estimates to hold the stiffness matrix coefficients and operation counts to 
compute the matrix-vector products for hexahedral meshes, considering the element- 
by-element (EBE) and edge-based strategies. In this Table we considered nel ~ 
modes, nedges ~ 13xnel. 



Table 2. Memory to hold the stiffness matrix coefficients and computational costs for element 
and edge-based matrix-vector products for hexahedral finite element meshes. 



Data Structure 


Memory 


flop 


i/a 


EBE 


300 X nnodes 


1,152 X nnodes 


72 X nnodes 


Edges 


1 1 7 X nnodes 


336 X nnodes 


234 X nnodes 



Clearly data in Tables 1 and 2 show the superiority of the edge-based scheme over 
element-by-element strategies. However, compared to EBE data structure, the edge 
scheme does not present a good balance between flop and i/a operations. Indirect 
addressing represents a major CPU overhead in vector, RISC and cache-based parallel 
machines. To improve this ratio, Lohner (1994) have proposed several alternatives to 
the single edge scheme. The underlying concept of such alternatives is that once data 
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has been gathered, reuse them as much as possible. This idea, combined with node 
renumbering strategies, Lohner (1998), introduces further enhancements in the finite 
element edge-based scheme. We have found (Martins et al, 1997) that, for tetrahedral 
meshes, structures formed by gathering edges in spatial triangular and tetrahedral 
arrangements, the superedges, present a high data reutilization ratio and are simple to 
implement. The superedges are formed reordering the edge list, gathering edges with 
common nodes to form tetrahedra and triangles. To make a distinction between 
elements and superedges, we call a triangular superedge a superedgeS and a 
tetrahedral superedge a superedge6. The matrix-vector product for a superedgeS may 
be expressed as, 

( 8 ) 

Kp= X [K,p, + + K,^2Ps+2 ) 

s=l,4,7... 



and for a superedge6. 



nedb ( 9 ) 

Kp = +k,^ 2P,+2 +K,+-iP,+2 +^,+5.p,+ 5 j 



where nedS and ned6 are respectively the number of edges grouped as superedgeS’ s 
and superedge6’s. Table 3 gives the estimates for i/a reduction and flop increase for 
both types of superedges. We may see that we achieved a good reduction of i/a 
operations per edge, with a negligible increase of float point operations. 



Table 3. Indirect addressing reduction and flop increase for the superedges. 



Type 


Edges 


Nodes 


ia/edge 


i/a reduction 


flop/edge 


flop increase 


Edge 


1 


2 


18:1 


1.00 


46:1 


1.00 


SuperedgeS 


3 


3 


27:3 


0.50 


134:3 


0.97 


Superedge6 


6 


4 


36:6 


0.33 


302:6 


1.09 



For hexahedral meshes we may gather the edges forming the superedges shown in 
Figure 2. 
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Fig 2. Superedge arrangements forhexaedra. 



The resulting matrix-vector products for each superedge type can be expressed as. 



n^6 ( 10 ) 

Kp= X (KsPs + K,+iP,+i + K,^2 Ps+2 + K,+3Ps+ 3 + ^s+4Ps+4 + ^s+sPs+S ) 

5 = 1 . 7 . 13 ... 



nedS 

Kp = 'ZiK^p^ +K,^iP,^i +K,^2 Ps 42 +- + Ks+ePs46 +K,^jP,+j) 

5 = 1 , 9 , 17 ... 



nedl6 (12) 

~ ^^K^sPs ^s+\Ps+\ ^s+lPs+2 ’^•"^■^ 5 + 14 /^ 5+14 ”*"-^ 5 + 15 /^ 5 + 15 / 

5 = 1 . 17 , 33 ... 



nedl^ ( 13 ) 

^P = m^^sPs +f^s+lPs+l +f^s+2Ps+2 +•■• +-^s+26/'j+26 + ^s+llPs+n) 
s=1.29,57... 



where ned6, ned8, nedl6 and ned28 are respectively the number of edges grouped as 
s6, s8, si 6 and s28 types. Table 4 gives the estimates for i/a reduction and flop 
increase for these superedge types. We may observe that we also achieved a good i/a 
reduction with a negligible increase of float point operations. However coding 
complexity is increased, particularly for the sl6 and s28 superedges. For a given finite 
element mesh we first reorder the nodes by Reverse Cuthill-Mckee algorithm to 
improve data locality. Then we extract the edges forming as much as possible 
superedges. After that we color each set of edges by a greedy algorithm to allow 
parallelization on shared vector multiprocessors and scalable shared memory 
machines. We have observed that for general unstructured grids more than 50% of all 
edges can be grouped into superedges. 




Parallel Edge-Based Finite Element Techniques for Nonlinear Solid Mechanics 



513 



Table 4. Indirect addressing reduction and flop increase for the superedges. 



Type 


Edges 


Nodes 


ia/edege 


i/a reduction 


flop/edge 


flop increase 


Edge 


1 


2 


18: 1 


1.00 


46: 1 


1.00 


s6 


6 


4 


36: 6 


0.33 


302: 6 


1.09 


s8 


8 


8 


72: 8 


0.50 


410: 8 


1.11 


sl6 


16 


8 


72:16 


0.25 


861:16 


1.17 


s28 


28 


8 


72:28 


0.14 


1361:28 


1.06 



4. Numerical Examples 

4.1 Performance Assessment of Edge-Based Matrix-Vector Product for 
Tetrahedral Meshes 

The performances of the single edge-based matrix-vector product algorithm and the 
algorithms resulting from the decomposition of an unstructured grid composed by 
tetrahedra into superedges are shown in Table 5 and 6, respectively for a Cray J90 
superworkstation and for a SGI Origin 2000 with rlOOOO processors. In these 
experiments we employed randomly generated indirect addressing to map global to 
local, that is, edge quantities. Table 5 lists the CPU times for the matrix-vector 
products on the Cray J90 for an increasing number of edges, supposing that all edges 
in the mesh may be grouped as superedgeTs or superedgeb’s. 



Table 5. CPU times in seconds for edge-based matrix-vector products on the Cray J90. 



Number of Nedges 


Edges 


SuperedgeS 


Superedgeb 


3,840 


1.92 


1.89 


1.87 


38,400 


2.81 


2.42 


2.23 


384,000 


11.96 


8.12 


5.82 


3,840,000 


102.4 


59.64 


42.02 


38,400,000 


1,005.19 


579.01 


399.06 



We may observe that gathering the edges in superedges reduces considerably the 
CPU times, particularly for in the superedgeb case. Another set of experiments were 
conducted on the SGI Origin 2000, a scalable shared memory multiprocessor. The 
average results of 5 runs, in non-dedicated mode, considering a total number of edges 
of 2,000,000 are shown in Table 6. We may observe that all data structures present 
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good scalability, but the superedges are faster. For 32 CPU’s the superedgeb matrix- 
vector product is almost 4 times faster than the product with single edges. 



Table 6. CPU times in seconds for edge-based matrix- vector products on the SGI Origin 2000. 



Processors 


Edges 


SuperedgeS 


Superedge6 


4 


48.0 


22.8 


15.6 


8 


31.0 


15.8 


11.1 


16 


19.0 


9.6 


6.8 


32 


10.9 


5.8 


3.9 



4.2 Performance Assessment of Edge-Based Matrix-Vector Product for 
Hexaedrical Meshes 

The performances of the edge matrix-vector product algorithm and the algorithms 
for the s6 to s28 decomposition of a hexaedrical finite element mesh are shown in 
Tables 7 and 8, respectively for a Cray J90 superworkstation and for a SGI Origin 
2000. These experiments were also conducted under the same conditions of the 
previous experiment, that is, we employed randomly generated indirect addressing to 
map global to edge quantities. Table 7 lists for each superedge arrangement the CPU 
times for the matrix-vector products on the Cray J90, supposing that all edges in the 
mesh can be grouped as s6, s8, si 6 or s28 superedges. 



Table 7. CPU times in seconds for edge-based matrix-vector products on the Cray J90se. 



Data 


Hedges = 


Hedge s= 


Hedges = 


Structure 


21,504 


215,400 


2,150,400 


Edge 


0.011 


0.109 


1.084 


s6 


0.011 


0.108 


1.080 


s8 


0.014 


0.134 


1.346 


sl6 


0.016 


0.148 


1.483 


s28 


0.013 


0.123 


1.233 



We may observe that only the s6 arrangement reduces the CPU time when 
compared to the performance of the single edge algorithm. This behavior may be 
attributed to the code complexity ois8, sl6 and s28 algorithms. We also made similar 
experiments on the SGI Origin 2000, a scalable shared memory multiprocessor. The 
results, for the same amount of edges are listed in Table 8. We may observe that 
although the s6 algorithm is still the faster, all other superedge arrangements are 
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faster than the single edge matrix-vector algorithm. This may be credited to the 
memory hierarchy of the SGI Origin 2000, where data locality plays a fundamental 
role in cache optimization. Parallel performance in this case is similar to the results 
obtained in the previous set of experiments. For nedges=2,\5QfiQQ all superedge 
arrangements achieved speed-up’s around 4 on 32 processors with respect to a 4- 
processor run. 



Table 8. CPU times in seconds for edge-based matrix- vector products on the SGI Origin 2000. 



Data Structure 


nedges = 


nedges = 


nedges = 




21,504 


215,400 


2,150,400 


Edge 


0.011 


0.14 


1.33 


s6 


0.006 


0.12 


0.75 


s8 


0.007 


0.15 


1.22 


sl6 


0.011 


0.14 


1.01 


s28 


0.009 


0.13 


1.23 



4.3 Extensional Behavior of a Sedimentary Basin 

We study the extensional behavior of a sedimentary basin presenting a sedimentary 
cover (4 km) over a basement (2 km) with length of 15 km and thickness of 6 km. The 
model has an ancient inclined fault with 500 m length and 60“ of slope. The relevant 
material properties are compatible with the sediment pre-rift sequence and basement. 
We have densities of 2450 kg/m^ and 2800 kg/m^ respectively for the sediment layer 
and basement; Young’s modulus of 20 GPa for the sedimentary cover and 60 GPa for 
the basement; Poisson’s ratio, 0.3 for both rocks. The ratio between initial horizontal 
and vertical (gravitational) stresses is 0.429. We assume that both materials are under 
undrained conditions and modeled by Mohr-Coulomb failure criterion. Thus, we have 
sedimentary cover cohesion of 30 MPa, basement cohesion of 60 MPa, and internal 
friction angle of 30“ for both materials. The finite element mesh (see Figure 3) 
comprises 2,611,036 tetrahedra, 445,752 nodal points and 3,916,554 edges. The 
number of superedge6^s is 57% of the total number of edges, while the number of 
superedge3’s is just 6% of total. We consider the model simply supported at its left 
and bottom faces, and we apply tension stresses at the right face and shear stresses at 
the basement, opposing the basin extension. The loads are applied in 12 increments, 
and the analysis is performed until the complete failure of the model. Memory 
requirements to solve this problem employing element and edge-based data structures 
are respectively 203.9 and 35.3 Mwords respectively. We solve this problem on a 16 
CPU’s Cray J90se using the ITS method and the edge-based strategy. Displacement 
and residual tolerances were set to 10'^. We selected PCG tolerances in the interval 
[10'^, 10"']. The parallel solution took only 15 minutes of elapsed time, corresponding 
to 36 nonlinear ITS iterations and 9,429 PCG iterations. Figure 4 shows the yield ratio 
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contours in the last load increment. The effects of the fault and the two material layers 
may be clearly seen. 




Fig. 3. Finite element mesh for the sedimentary basin (left) and a zoom at the fault (right). 




Fig. 4. Yield ratio contours for the sedimentary basin at 12* load increment. 



5. Conclusions 

We presented a fast, parallel, memory inexpensive, finite element solution scheme for 
analyzing large-scale 3d nonlinear solid mechanics. Our scheme employ novel 
nonlinear solution strategies and suitable data-structures, allowing us to tackle 
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challenging problems in their full complexity. The novel data structures, based on 
edges, rather than elements, are applicable to meshes composed by tetrahedra and 
hexaedra. Grouping edges into superedges we may improve further the computational 
efficiency of the matrix- vector product, reducing the overhead associated to indirect 
addressing, particularly on scalable shared memory multiprocessors. 
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Abstract. 3D simulation in electrical engineering is based on recent re- 
search work (Whitney’s elements, auto-gauged formulations, discretiza- 
tion of the source terms) and it results in complex and irregular codes. 
Generally, explicit message passing is used to parallelize this kind of 
applications requiring tedious and error prone low level coding of com- 
plex communication schedules to deal with irregularity. In this paper, 
we focus on a high level approach using the data-parallel language High 
Performance Fortran. It allows both an easier maintenance and a higher 
software productivity for electrical engineers. Though HPF was initially 
conceived for regular applications, it can be successfully used for irregu- 
lar applications when using an unstructured communication library that 
deals with indirect data accesses. 



1 Introduction 

Electrical engineering consists of designing electrical devices like iron core coils 
(example 1) or permanent magnet machines (example 2) (see Fig. 1). As pro- 
totypes can be expensive, numerical simulation is a good solution to reduce 
development costs. It allows to predict device performance from physical design 
information. Accurate simulations require 3D models, inducing high storage ca- 
pacity and CPU power needs. As computation times can be very important, 
parallel computers are well suited for these models. 

3D Electromagnetic problem modeling is based on Maxwell’s equations. Gen- 
erally, the resolution of these partial differential equations requires numerical 
methods. They transform the differential equations into an algebraic equation 
system whose solution gives an approximation of the exact solution. The space 
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discretization and the time discretization can be done respectively by the finite 
element method (FEM) and by the finite difference method (FDM). The finite 
elements used are Whitney’s elements (nodes, edges, facets and volumes) [1]: 
they allow to keep the properties of continuity of the field at the discrete level. 
In function of the studied problem, the equation system can be linear or non- 
linear. 

As for many other engineering applications, FEM codes use irregular data 
structures such as sparse matrices where data access is done via indirect address- 
ing. Therefore, finding data distributions that provide both high data locality 
and good load balancing is difficult. Generally, parallel versions of these codes 
use explicit message passing. 

In this paper, we focus on a data-parallel approach with High Performance 
Fortran (HPF). Three reasons can justify this choice. First, a high level program- 
ming language is more convenient than explicit message passing for electrical en- 
gineers. It allows both an easier maintenance and a higher software productivity. 
Second, libraries for optimizing unstructured communications in codes with in- 
direct addressing exist [4] . The basic idea is that these codes reuse several times 
the same communication patterns. Therefore, it is possible to compute these 
patterns one time and to reuse them if possible. Third, the programmer can 
further optimize its code by mixing special manual data placements and simple 
HPF distributions to provide both high data locality and good load balancing. 

Section 2 presents the main features of our electromagnetic code. Section 3 
presents the HPF parallelization of the magnetostatic part of this code. Section 4 
presents the unstructured communication library we used to efficiently parallelize 
the preconditioned conjugate gradient method. Section 5 presents the results we 
obtained on a SGI Origin and an IBM SP2. Conclusion and future works are 
given in Section 6. 




Fig. 1. An iron core coil and a permanent magnet machine. 
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2 The Code 

The L2EP at the University of Lille has developed a 3D FORTRAN 77 code 
for modeling magnetostatic (time-independent) and magnetodynamic (time-de- 
pendent) problems [5]. The magnetostatic part uses formulations in terms of 
scalar or vector potentials where the unknowns of the problem are respectively 
the nodes and the edges of the grid. The magnetodynamic part uses hybrid 
formulations that mix scalar and vector potentials. 

The great particularity of this code is the computation of the source terms 
using the tree technique. Generally, in the electromagnetic domain, gauges are 
required to obtain an unique solution. They can be added into the partial differ- 
ential equations but this solution involves additional computations during the 
discretization step. Another solution results from a recent research work [7]. A 
formulation is said compatible when all its entities have been discretized onto 
Whitney’s elements. This work has shown that problems are auto-gauged when 
both a compatible formulation and an iterative solver are used. Therefore, to 
obtain a compatible formulation, source terms must be discretized using the tree 
technique, actually a graph algorithm. 

As the meshing tool only produces nodes and elements, the edges and the 
facets must be explicitly computed. The time discretization is done with Euler’s 
implicit algorithm. The FEM discretization of a non-linear problem results in an 
iterative loop of resolutions of linear equation systems. Two non-linear methods 
are used: Newton-Raphson’s method and the fixed-point method. Their utiliza- 
tion is formulation-dependent. The linear equation systems are either symmetric 
positive definite or symmetric semi-positive definite. Therefore, the precondi- 
tioned conjugate gradient method (PCG) is used. The preconditioner results 
from an incomplete factorization of Grout. 

The overall structure of this code is as follows: 

1. define the media, the inductors, the magnets, the boundary 
conditions, etc. 

2. read the input file and compute the edges and the facets. 

3. compute the source terms. 

4. check the boundary conditions and number the unknowns. 

5. time loop: 

5.1. create the Compressed Spare Row (CSR) representation of the 
equation system. 

5.2. non-linear loop: 

5.2.1. assembly loop: 

- compute and store the contribution of all the 
elements in the equation system. 

5.2.2. compute the preconditioning matrix. 

5.2.3. solving loop: 

- iterate preconditioned conjugate gradient. 



This structure shows that computation times can be very important when 
we use time-dependent formulations in the non-linear case. 
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The matrices for the equation systems are represented by the Compressed 
Sparse Row (CSR) format as shown in Fig. 2. As they are symmetric, the upper 
triangular part is not explicitly available to save memory. 



integer : ; N = 8 

integer : : NZ = 14 ! non-zeros 



integer, dimension (N+1) : : iA 

integer, dimension (NZ) ; : jA 

real (KIND=0 . OdO) , dimension (NZ) :: A 




iA 



jA 

A 



123456789 




Fig. 2. Compress sparse row format of a matrix. 



3 HPF Parallelization of the Magnetostatic Part 

For software engineering reasons, the magnetostatic part of the code has been 
ported to Fortran 90 and it has been optimized. The Fortran 90 version makes 
extensive use of modules (one is devoted to the data structures), derived data 
types and array operations. It consists of about 9000 lines of code. This new 
version has been converted to HPF considering only the parallelization of the 
assembling and the solver that take about 80% of the whole execution time in 
the linear case. All other parts of the code remained serial. 

In a first step, all the data structures including the whole equation system 
have been replicated on all the processors. Each of them has to perform the 
same computations as the others until the CSR structure of the equation system 
is created. Some small code changes were necessary to reduce the number of 
broadcasts implied by the serial I/O operations. 

The assembling loop is a parallel loop over all the elements, each element 
contributing some entries to the equation system (see Fig. 3). Though one un- 
known can belong to more than one element, and so two elements might add 
their contribution at the same position, the addition is assumed to be an asso- 
ciative and commutative operation. The order in which the loop iterations are 
executed does not change the final result (except for possible round-off errors). 
Therefore it is safe to use the INDEPENDENT directive of HPF together with the 
REDUCTION clause for the arrays containing the values of A (matrix) and B (right 
hand side). A one-dimensional block-distributed template, whose size is given by 
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the number of elements, has been used to specify the work distribution of this 
loop via the ON HOME clause. 



Mesh 




ihpfs template ELEMENTS (NEL) 

!hpf$ independent, reduction (A, B) 
!hpfS+on home (ELEMENTS (I EL) ) 
do IEL=1,NEL 

end do 



A B 




Fig. 3. Assembling the contributions for the equation system. 



In a second step, the equation system and the preconditioning matrix have 
been general block distributed in such a way that all the information of one row 
A{i, :) resides on the same processor as vector element z{i). By the RESIDENT 
directive, the HPF compiler gets the information that avoids unnecessary checks 
and synchronizations for accesses to these matrices. 

The algorithm for the preconditioned conjugate gradient method is domi- 
nated by the matrix- vector multiplications and by the forward/backward sub- 
stitutions required during the preconditioning steps. 

Due to the dependences in the computations, the incomplete factorization 
of Grout before the CG iterations and the forward/backward substitution per 
iteration are not parallel at all. The factorization has been replaced with an 
incomplete block factorization of Grout that takes only the local blocks into ac- 
count forgetting the coupling elements. By this way, the factorization and the 
forward/backward substitution do not require any communication. As the corre- 
sponding preconditioning matrix becomes less accurate, the number of iterations 
increases with the number of blocks (processors) . But the overhead of more it- 
erations is less than the extra work and communication needed otherwise. HPF 
provides LOCAL extrinsic routines where every processor sees only the local part 
of the data. This concept is well suited to restrict the factorization and the for- 
ward/backward substitution to the local blocks where local dependences in one 
block are respected and global dependences between the blocks are ignored. 

The matrix- vector multiplication uses halos [3] (see next section) that provide 
an image of the non-local part of a vector on each processor, and the commu- 
nication schedule needed for the related updating operations. The computation 
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time for the halo and its communication schedule is amortized over the number 
of iterations. 

For running our HPF code on parallel machines we used the Adaptor HPF 
compilation system of GMD [2]. Only Adaptor supported the features needed 
for the application (ON clause, general block distributions, RESIDENT directive, 
REDUCTION directive, LOCAL routines, halos and reuse of communication sched- 
ules). By means of a source-to-source transformation, Adaptor translates the 
data parallel HPF program to an equivalent SPMD program (single program, 
multiple data) in Fortran where this Fortran program is compiled with a native 
Fortran compiler. The generated SPMD program contains a lot of calls to the 
Adaptor specific HPF runtime system, called DALIB (distributed array library) 
that implements also the functionality needed for halos (see Figure 4). 



Data Parallel Program 
(High Performance Fortran) 












s..... ^ 


r 


fadapt 


“1 


1 ADAPTOR 

System 


DALIB 




_J 


L J 



Fig. 4. Schematic view of Adaptor HPF compilation. 



4 Unstructured Communication Library Approach 

The Adaptor HPF compilation system provides a library that supports shadow 
edges (ghost points) for unstructured applications using indirect addressing, also 
called halos [4]. A halo provides additionally allocated local memory to keep on 
one processor also non-local values of the data that is accessed by the processor 
and a communication schedule that reflects the communication pattern between 
the processors to update these non-local copies. As the size of the halo and the 
communication schedule depend on the values of the indirection array, they can 
only be computed at runtime. 

The use of halos (overlapping, shadow points) is common manual practice 
in message passing programs for the parallelization of unstructured scientific 
applications. But the calculation of halo sizes and communication schedules is 
tedious and error prone and requires a lot of additional coding. Up to now, 
commercial HPF compilers do not support halos. The idea of halos has already 
been followed within the HPF-I- project and implemented in the Vienna Fortran 
Compiler [3] where the use of halos is supported by additional language features 
instead of a library. 
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Fig. 5 shows the use of halos for the matrix-vector multiplication B = Ax X . 
The vectors X and B are block distributed among the available processors. The 
matrix A is distributed in such a way that one row A[i, :) is owned by the same 
processor that owns B{i). For the matrix- vector multiplication, every processor 
needs all elements X[j) for the non-zero entries A(i,j) owned by it. Though 
most of these values might be available, there remain some non-local accesses. 
The halo will contain these non-local values after an update operation using the 
corresponding halo schedule. 



BA X I I owned data | | shadow data 

1 2 3 4 5 6 7 8 




Fig. 5. Distribution of matrix with halo nodes for the vector. 



As mentioned before, the upper triangular of A part is not explicitly avail- 
able. For j > i the elements A(i,j) must be accessed via A(j,i)- To avoid the 
communication of matrix elements the values of A(i,j) * X[j) are not computed 
by the owner of B{i) but by the owner of X{j) as here A{j,i) is local. Therefore 
we have an additional communication step to reduce the non-local contributions 
of B. But for this unstructured reduction we can use the same halo structure for 
the vector B. 

For the calculation of the halo, we had to provide the halo array, which 
is the indirectly addressed array, and the halo indexes that are used for the 
indirect addressing in the distributed dimension. In our case, the halo arrays 
are the vectors X and B, and the halo indexes are given by the integer array 
jA containing the column indices used in the CSR format. Beside the insertion 
of some subroutine calls for the HPF halo library, we had only to insert some 
HPF directives for the parallelization of the loop implementing the matrix-vector 
multiplication. 

The calculation of the halo structure is rather expensive. But we can use the 
same halo structure for the update of the non-local copies of vector X and for 
the reduction of the non-local contributions of vector B. Furthermore, this halo 
can be reused for all iterations of the iteration loop in the solver as the structure 
of the matrix and therefore the halo indexes do not change. 
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5 Results 

Table 1 shows the characteristics of the problems of Fig. 1 in the case of a 
vector potential formulation. These results have been obtained for the test cases 
of Table 1 on a SGI Origin and on an IBM SP2 in the linear case (in the 
magnetodynamic case, each column would represent results associated with one 
time increment). Their quality has been measured by two ways: graphically with 
the dumped files and numerically with the computed magnetic energies. All the 
computed solutions are the same. 



Table 1. Test cases of the code 





example 1 


example 2 


nodes 


8059 


25730 


edges 


51356 


169942 


facets 


84620 


284669 


elements 


41322 


140456 


unknowns 


49480 


162939 


non-zero entries 


412255 


1382596 



Table 2 presents results for example 1 on the SGI Origin with up to four 
processors. Both, the assembling and the solver scale well for a small number of 
processors. 



Table 2. Results for example 1, SGI Origin 





NP = 1 


NP = 2 


NP = 3 


NP = 4 


assembling 


12.87 s 


6.54 s 


4.63 s 


3.61 s 


solver 


30.47 s 


18.89 s 


9.74 s 


8.80 s 


iterations 


225 


260 


213 


237 



Table 3 presents results for example 1 on the IBM SP2 with up to 16 proces- 
sors. In the assembling loop the computational work for one element is so high 
that the parallelization still gives good speed-ups for more processors even if the 
reduction overhead increases with the number of processors. The scalability of 
the solver is limited as the data distribution has not been optimized for data 
locality yet. On the other hand, the number of solver iterations does not increase 
dramatically with the number of processors and the higher inaccuracy. 

Comparison between Table 2 and Table 3 shows that for the two parallel 
machines the numbers of iterations are different for a same number of processors. 
This can be explained by the application of different aggressive optimizations 
of the native Fortran compilers (optimization level 3) that may generate (even 
slightly) different results. 
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Table 3. Results for example 1, IBM SP2 





NP = 1 


NP = 2 


NP = 4 


NP = 8 


NP = 16 


assembling 


32.70 s 


16.77 s 


8.98 s 


5.27 s 


3.47 s 


solver 


48.45 s 


36.99 s 


23.51 s 


14.45 s 


10.59 s 


iterations 


224 


247 


257 


247 


258 



Table 4 presents results for example 2 on the SGI Origin. The assembling and 
one iteration of the solver scale well. Regarding the number of solver iterations, 
the results are surprising. For its explanation we must interest ourselves to the 
numbering of the unknowns which plays an important role in the conditioning of 
the equation system. In our case, the meshing tool sorts its elements by volumes, 
every volume corresponding to a magnetic permeability (air, iron, etc.). To avoid 
large jumps of coefficients in the matrix, the unknowns are numbered by scan- 
ning the list of elements. Therefore, to every volume of the grid corresponds a 
homogeneous block of the matrix. The conditioning of this matrix is directly 
linked to the uniformity of the magnetic permeability of these different bfocks. 
When this uniformity is poor, block preconditioners resulting from domain de- 
composition methods can be used. By preconditioning each block independently, 
they allow to bypass the problem. In our case, the incomplete block factorization 
of Grout has led to the same result. For its verification we have re-sorted the 
elements of the grid by merging volumes with the same magnetic permeability. 
As a result, we have divided the number of PGG iterations by two in the Fortran 
90 program. 



Table 4. Results for example 2, SGI Origin 





NP = 1 


NP = 2 


NP = 3 


NP = 4 


assembling 


43.77 s 


22.71 s 


16.03 s 


12.95 s 


solver 


625.06 s 


160.84 s 


116.63 s 


93.25 s 


iterations 


1174 


464 


509 


505 


time/iter 


532 ms 


347 ms 


229 ms 


165 ms 



6 Conclusions and Future Work 

This HPF version achieves acceptable speed-ups for smaller number of proces- 
sors. According to the few number of HPF directive added in the Fortran 90 
code, it is a very cheap solution that allows both an easier maintenance and 
a higher software productivity for electrical engineers, compared to an explicit 
message passing version. This was its main objective. Results obtained for real 
electrical engineering problems show that HPF can deal efficiently with irregular 
codes when using an irregular communication library. 
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In order to improve data-locality and to reduce memory consumption, the 

next HPF version will use a conjugate gradient method where the preconditioner 

will be based on domain decomposition using a Schur complement method [6]. 
In the final step of this work we will add the magnetodynamic formulations. 
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Abstract. The computational requirements of cloth and other non-rigid 
solids simulations are high and often the run time is far from real time 
simulations. In this paper, we present an efficient parallel solution of 
the problem, which is a consequence of a comprehensive analysis of the 
data distributions and the parallel behavior of various iterative system 
solvers and preconditioners. Our parallel code combines data parallelism 
with task parallelism, achieving a good load balancing and minimizing 
the communication cost. The execution time obtained for a typical prob- 
lem size, its superlinear speed-up, and the isoscalability shown by the 
model, will allow to reach real-time simulations in sceneries of growing 
complexity, using the most powerful parallel computers. 



1 Introduction 



Cloth and flexible material simulation is an essential topic in computer animation 
of realistic virtual humans and dynamic sceneries. New emerging technologies, 
as interactive digital television and multimedia products make necessary the de- 
velopment of powerful tools able to perform real-time simulations. There are 
many approaches to simulate flexible materials. Geometrical models are usually 
considered the fastest but they require a high degree of user intervention, mak- 
ing them unusable for interacting applications. In this paper, a physically-based 
model, that provides a reliable representation of the behavior of the materials 
(e.g. garments, plants), has been chosen. In a physical approach, clothes and 
other non-rigid objects are usually represented by interacting discrete compo- 
nents (finite elements, springs-masses, patches) each one numerically modeled by 
an ordinary differential equation (1), where x is the vector of positions of the 
masses M. The derivatives of x, are the velocities x = v and the accelerations x. 



X = M f{x, x) 



dt 



M~'^f{x,v) 



( 1 ) 



Also, in most energy-, forces- or constraints-based formulations, equations con- 
tain non-linear components, that are typically linearized by means of a first order 
Taylor expansion. The use of explicit integration methods, such as forward Euler 
and Runge-Kutta, results in easily programmable code and accurate simulations 
[3], and have been broadly used during the last decade, but a recent work [1] 
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demonstrates that implicit methods overcome the performance of explicit ones, 
assuming a non visually perceptible lost of precision. In the composition of vir- 
tual sceneries, appearance, rather than accuracy is required, so, in this work, 
an implicit technique, backward Euler method, has been used. In section 2, a 
description of the models used and the implementation technique for the implicit 
integrator is presented. Also, the resolution of the resulting system of algebraic 
equations by means of the Conjugate Gradient method is analyzed. In section 
3, the parallel algorithm and the data distribution technique for a scenery is 
shown. Finally, in section 4, we present some results and conclusions. 

2 Implementation 

To create animations, a time-stepping algorithm has been implemented. Every 
step is mainly performed in three phases: computation of forces, determination 
of interactions and resolution of the system. The iterative algorithm is shown 
below: 



do { 

computeForces 0 ; 
collisionDetectionO ; 
solveSystemO ; 
updateSystemO ; 
time = time + timeStep 
}while(time<FinalTime) 

The updateSystem procedure computes the new state, made up of the position 
and velocity of each element, calculated from the previous one. computeForces, 
collisionDetection and solveSystem stages are described below. 



2.1 Forces 



Forces and constraints are evaluated on every discrete element in order to com- 
pute the equation coefficients for the Newton’s second law. Our model considers 
both spring-mass discretization of 2D-3D objects, and triangles patches for the 
special case of 2D objects like garments. The particular forces considered are: 
visco-spring forces mapped in the grid for the former model; and stretch, shear 
and bend forces for the later. In both cases, gravity and air drag forces have been 
also included. The backward Euler method approximates the second Newton law 
by the equation (2) in the k-th time step, 

Av = At- M~'^- f{xk+i,Vk+i) = At- M~^- f{xk + Ax, Vk + Av) (2) 



This is a non-linear system of equations which has been time-linearized by a first 
order Taylor expansion as follows: 



df 


Ax + ■ 


df 


dx 


k 


dv 



Av 



fk+l — /(^fc-t- 1 5 '^fc-t- 1 ) — /fc 4” 



fc 



(3) 
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being Ax = At{vk + Av). An energy function E„ for every discrete element a 
is analytically described; the forces acting on particle i are derived from fi = 
—dEct/dxi and the arising coefficients from the analytical partial derivations 
in equation (3) has been coded for its numerical evaluation. All above gives a 
large system of algebraic linear equations with a sparse matrix of coefficients. 
The sparsity pattern of the system matrix, for internal forces, is perfectly known, 
because every non-zero component are the neighbours affecting a given particle 
for a given tessellation of the object. A Compressed Row Storage (CRS) of the 
matrix is used in order to minimize memory usage. The process of building the 
system includes loops in which the elements in the matrix are updated following 
an irregular pattern. Such pattern is termed as histogram reduction. 

2.2 Solver 

In the solveSystem procedure, the unknowns Av are computed. As stated above, 
implicit integration methods requires the solution of a large, sparse linear sys- 
tem of equations that must be simultaneously fulfilled. An iterative solver, the 
preconditioned conjugate gradient method, has proven to work well, in practice. 
This method requires relatively few, and reasonably cheap, iterations to converge 
[1]. The choice of a good preconditioner can result in a significant reduction of 
the computational cost in this stage. In this work, five preconditioners have been 
studied. We have tested every preconditioner for a wide set of sceneries. In every 
case, Block-Jacobi has shown to be the fastest although the required number of 
iterations for a given tolerance is larger than that in the incomplete factorization 
techniques. 



Table 1. Accumulated number of PCG iterations and the elapsed time in sec- 
onds of 100 steps of the simulation loop. 



Preconditioners 


Number of iterations 


Elapsed Time 


Jacobi 


15022 


28.118 


Block-Jacobi 


10001 


21.632 


{L + D)D~\D + U) 


5298 


22.240 


Incomplete-LU 


4326 


24.651 


I-Cholesky 


4319 


25.483 



In table 1, we present the accumulated number of the PCG inner iterations of 
the simulation for each preconditioner for an example simulation of one hundred 
simulation step of 10ms for a total of one simulated second. The average number 
of iterations are from 150 obtained with the Jacobi preconditioner to 43 iterations 
with incomplete factorizations. The elapsed execution time includes these 100 
time steps. In each step the system matrix and the preconditioner have been 
built, and the system of equations has been solved. These results have been 
obtained on a R10000-250MHz processor. 
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We have chosen the Block-Jacobi preconditioner for our algorithm, not only 
for the execution time but also for its better parallel behavior [2]. Due to the 
tridimensional nature of the problem, blocks have been formed grouping the 
physical variables, so the block dimension is 3 x 3. 



2.3 Collisions 

To detect possible interactions and forbidden situations, like colliding surfaces 
or body penetrations, we have used a hierarchical approach based on bounding- 
boxes. In these cases, we introduce additional forces to maintain the system in 
a legal state by a penalty function. The additional coefficients introduced in 
the system matrix is stored in an additional data structure. This forces will be 
included as described in the previous section. In the case of cloth simulation, the 
self-collision detection and the human-cloth collisions may be critical and the 
computational cost can be extremely high [4]. To improve the performance of 
the collision detection algorithm, the hierarchical method has been modified in 
order to exploit the temporal coherence, using additional lists of recent collisions. 



3 Parallelization 

The parallelization of the model has been performed on a SGI 0rigin2000, a 
NUMA multiprocessor architecture, using the SGI-specific directives. Current 
automatic parallelizing tools are not able to extract enough parallelism from 
this kind of irregular reductions. We have used a cache-coherent shared memory 
programming model, and a data parallelism strategy. Task parallelism has been 
also considered for the collision detection stage. The distribution of the objects 
in a scenery between the processors is performed using a proportional rule based 
on the number of elements, such as particles, triangles, and forces. The redis- 
tribution and reordering of the elements inside an object among the assigned 
processors have been performed using domain decomposition methods. 




Fig. 1. Coefficient matrices for original, MRD, and stripped ordering. 
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These strategies, the reordering of the particles and the redistribution of 
the iterations in the loops which fill the system matrix, are very efficient. They 
only need to be done once, at the beginning of the simulation process, and the 
resulting data have a high locality during the simulation. To avoid an undesirable 
excess of mutual exclusion writes during the process of building the system, those 
particles that must be updated by more than one processors are replicated in 
them. The final system is obtained as an accumulation of the contributions from 
the processors. 

In figure 1, the non-zero coefficient of the system matrix for three different 
reorderings of the particles are shown. It can be observed that the stripped order- 
ing results in a thin banded diagonal which will result in the parallel distribution 
with less communication expenses. The Multiple Recursive Distribution (MRD), 
has more locality, which will result in a better cache usage. The choice of the 
method will depend of the scenery and the computational platform. 



3.1 Solver 

The PCG algorithm has been parallelized following a well-known strategy in 
which the successive parts of the vectors and the properly aligned rows of 
the matrices are distributed among the processors. Using this scheme and the 
Chronopoulos and Gear variant of PGG [2], very few messages and synchro- 
nization points are required along the iterative process. The parallelization of 
the operations have been performed by using the data distribution previously 
computed for the forces calculation stage. Gomputations involving data owned 
by a processor have been performed using sequential BLAS libraries, which are 
specially optimized for the underlying hardware. The use of Block-Jacobi pre- 
conditioner prevents lost of efficiency due to the remote memory access that 
appears in other preconditioners, like that based on incomplete factorizations, 
in the product between the preconditioning matrix and the vector. Operations 
involving the preconditioner are locals. 

In this paper, an innovative strategy for the parallel implementation of the 
PGG algorithm for cloth simulation is proposed. In our scheme, global synchro- 
nization points (GSPs) and message exchanges (MEs) among the processors in 
the iterative process can be handled in three different ways. First, GSPs and 
MEs can be eliminated when they are not required (for example, when two sets 
of processors are solving the corresponding subsystems for separate pieces of 
clothes). In this case, the two subsystem become independent, and the inherent 
parallelism of different elements in a scenery is considered. Second, GSPs and 
MEs are used in the usual way. This will be the general case for the paralleliza- 
tion inside a garment. A third case has been tested, keeping every GSP in the 
algorithm, but leaving the messages to be sent without any local synchroniza- 
tion between neighbour processors. Although the number of iterations usually 
increases, the simulation time has been reduced with about a 10%. An heuris- 
tic to recover the system, if any of the mentioned strategies fails has also been 
considered. 
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Fig. 2. Several frames from a simulation of a table cloth under windy conditions 



4 Results and Conclusions 

Figure 2 shows some frames of a simulation of a tablecloth. In figure 3, the exe- 
cution time, in seconds (excluding the collision detection phase), of one second 
of simulated time and the speed-up of three example models are shown. These 
results have been obtained using a SGI 0rigin2000 with 16 R10000-250Mhz pro- 
cessors. A real time simulation is observed for the former model using six pro- 
cessors. The speed-up of the third and more complex model shows a superlinear 
behavior with up to four processors, mainly due to the exploit of the memory 
hierarchy thanks to the improved data locality. These, and other preliminary 
results has been considered to extract several conclusions. 

The use of more recent computers and a higher number of processors for 
models a higher number of elements will allow real time simulations, taking 
into account the isoscalability shown in our preliminary results. The scenery 
complexity, considering interaction between several objects will be improved as 
the speed of the microprocessors increases. 
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Fig. 3. Time and Speed-UP of simulations with 599, 2602, 3520 particles respec- 
tively. 
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Abstract. Simulation of the dynamic behaviour of liquid-liquid systems is of 
prominent importance in many industrial fields. Algorithms for fast and reliable 
simulation of single stirred vessels and extraction columns have already been 
published by some of the present authors. In this work, we propose a 
methodology to develop a parallel version of a previously validated sequential 
algorithm, for the simulation of a liquid-liquid Kuhni column. We also diseuss 
the algorithm implementation in a distributed memory parallel-eomputing 
environment, using MPI. Despite the difficulties encountered to preserve 
efficiency in the case of a heterogeneous cluster, the results demonstrate 
performance improvements that clearly indicate that the approach followed may 
be successfully extended to allow real-time plant control applications. 



Key words: Distributed Memory Parallel Systems; MPI; Simulation of Liquid-Liquid 
Systems. 



I. Introduction 

The mass transfer efficiency of liquid-liquid agitated systems is highly dependent on 
the hydrodynamics of the dispersed phase, namely of the drop break-up and 
coalescence frequencies that result from the turbulence induced by agitation. In 
reacting systems, this behaviour is also of fundamental importance to the overall rate 
and selectivity of the process. A comprehensive and synthetic discussion about the 
behaviour of liquid-liquid systems is found in Ramkrishna’s work [1]. 

Knowledge of the dynamic behaviour of liquid-liquid systems is still limited, in 
particular when it comes to its implementation as physically accurate, fast and reliable 
algorithms, with effective predictive power and suitable for real-time plant control 
applications [2]. Potential fields of practical use of this knowledge base encompass 
very broad segments of chemical technology, including the recovery of important 
non-renewable resources or the removal of dangerous substances. 

Ribeiro L. M. [3] and Ribeiro L. M. et al. [4] published innovative algorithms for 
directly (numerically) solving the population balance equation for the simulation of 
the full trivariate (drop volume, v, solute concentration, c, and age, r) unsteady-state 
behaviour of interacting liquid-liquid dispersions, in single continuous (or batch) 

J. M.L.M. Palma et al. (Eds.): VECPAR 2000, LNCS 1981, pp. 536-547, 2001. 

© Springer-Verlag Berlin Heidelberg 2001 



Simulation of the Dynamic Behaviour of Liquid-Liquid Agitated Columns 



537 



Stirred vessels. Not only the start-up period towards the steady-state was simulated but 
also the system’s response to disturbances in the main operating variables (mean 
residence time, dispersed phase hold-up, agitation power input density, feed drop 
volume distribution and dispersed and continuous phase solute concentrations). The 
methodology used was later applied to a simplified version of the algorithm, that 
calculates the drop size distribution and the mean and standard deviation of solute 
concentration within each volume class [5]. This methodology was further extended 
to simulate the behaviour of a liquid-liquid extraction column [6]. 

The aim of this paper is to show that, using low cost high performance computing 
environments and the above referred methodology, it is possible to simulate in detail 
the dynamics of stirred liquid-liquid extraction columns, with execution times suitable 
for prediction of the behaviour of these systems and for control purposes. 



2. The Sequential Algorithm 

Following the experimental work carried out by Gomes [7] in a Kiihni pilot plant 
column of the Technical University of Munich, a sequential algorithm was developed 
to trace its dynamics [6]. This column has 150mm of internal diameter and 36 stages, 
each 70 mm high. 

A Kiihni column may be adequately described as a sequence of agitated vessels 
with back mixing and forward mixing effects on the movement of the dispersed phase 
along the column. The hydrodynamic phenomena of break-up and coalescence of the 
individual drops of the dispersed phase was modelled using the population balance 
formulation of Coulaloglou and Tavlarides [8]. 

Besides the interaction phenomena, the transport of the drops from one stage to the 
next must also be modelled. The transport model used was based on the one described 
by Cruz-Pinto [9], taking into account the eonstriction factor calculated by Goldman 
[10] and the dispersion equation developed by Regueiras [11]. The mathematical 
model equations used are presented elsewhere [11]. 

From the mathematical model, the drop birth and death rates due to break-up, 
coalescence and drop movement along the column are calculated. Representing by 
B{n,t) and Z)(«, ?)these source and sink terms, at time t and location [n ,« +dn ] of 
the drop phase space, the dynamics of the drop number density function x{n,t) is 
described by: 



dt ^ ’ 



d 

dW 



dn 

"97 



■ X {n ,t) 



B {n ,t) ~ D {n ,t) 



( 1 ) 



To numerically solve the above population balance equation, a phase space-time 
discretization is used and drops are assumed to reside on cell sites. Drops move from 
cell to cell in the discretized phase-space at each time step. The numerical integration 
scheme involves the explicit calculation of time derivatives, with a first-order 
backward finite-difference method [4]. 
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The sequential algorithm developed for the counter-current Kiihni column 
simulation is able to predict the local drop size distributions and the local hold-up 
profiles of the column. The algorithm was implemented in C++ and the corresponding 
program is presently available for Windows 9x and Windows NT environments [6]. 

The program consists of two parts: the initialization of the system and the column 
simulation. The corresponding flowchart is presented in Fig. 1. 

The main program starts reading all data needed to perform the simulation. This 
data includes the physical characteristics of the column, like the number of stages, 
stirrer diameter, height and diameter of each stage, the drop breakage, coalescence 
and transport model parameters, the physical properties of both phases, such as 
density, viscosity and interfacial tension, the operating conditions of the column, 
namely the flow rates of each phase and the stirrer rotational speed, the total 
simulated time, tmax, and the time interval. At, at which the program writes to a file 
the values of the column and system state variables. 

At time t=0, the column variables are initialized to a standard initial state, 
corresponding to a column filled with continuous phase and no dispersed phase. 

The program goes then into a loop where it writes the values of the column 
variables on a file, tests if the time reached the total simulated time value and, if not, 
calls the TimeStep routine to calculate the column status at time t+At. Then, it 
updates the value of t, and returns to the beginning of the loop. When tmax is reached 
the program exits the loop, writes global results to a file and terminates execution. 

The routine TimeStep executes the simulation of the column for a period of time, 
At, between two consecutive WriteData calls. In order to accomplish this 
objective, the routine calls the dXdt routine for each column stage and, based on the 
death frequencies obtained, calculates a suitable step value for the integration. This 
value, dt, is then used to calculate the new values of the variables describing the state 

of the column. When the accumulated time reaches At, this routine is exited, 
returning control to the main program loop. 

The routine dXdt calculates the drop birth and death frequencies inside a single 
column stage, as well as the number of drops per unit time exchanged with the 
contiguous stage. It also calculates the continuous phase flow rate between the same 
two stages. To perform these ealculations, this routine needs the values of the statue 
variables at both stages. Only the auxiliary variables of the current stage are modified. 

The hierarchy of the called routines and the routine tasks are outlined in Fig. 2 and 
Table 1, respectively. 

The routine LLExtrColumns corresponds to the ‘Initialization of the Column’ 
box and to the ‘Initialization of the variables’ box. TimeStep and dXdt routines are 
designated on the flowchart for their own names. 

We have already shown that the results obtained with the sequential program for 
the hold-ups and the drop size distributions at different stages of the column are in 
good agreement with the experimental data, for several operating conditions of the 
column [7]. 
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Fig. 1. Sequential algorithm and TimeStep routine 
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Fig. 2. The hierarchy of the called routines 



ClearDerivatives 


Prepares the variables for the calculations in dXdt. 


dXdt 


Calculates the drop birth and death frequencies of one stage. 


LLColMain 


Main part of the program; calls the routines. 


LLExtrColmuns 


Prepares each stage for the beginning of the simulation and 
calculates the inlet drop distributions. 


Time Step 


Executes the simulation for a given period of time. 


WriteData 


Outputs to a file the results at the end of each time-step. 


WriteFinalData 


Outputs to a file the final results. 



Table 1. Routine tasks 



So far, the program doesn’t include mass transfer calculations. With mass transfer, 
it is generally necessary to solve the population balance equation (1) in a tri- 
dimensional phase-space. In the present case, using a monovariate drop property 
(volume) distribution, the execution time achieved with a 120 MHz Pentium for one 
second of simulation time was four times longer than the real process, with a drop 
volume disctretization of 20 classes. Although already fast, in comparison to other 
resolution approaches [2], this algorithm needs to be further accelerated in order to be 
suitable for future control applications to liquid-liquid extraction columns, in mass 
transfer conditions. The introduction of excessive algorithm simplifications, other 
than those of the underlying mathematical model, are not desirable, as they would 
hide most of the information on the temporal behaviour of the dispersed phase 
properties distribution. This need to speedup the calculations was the motivation for 
the development of a parallel version of the sequential program. This parallel version, 
implemented for a distributed memory parallel computing environment, is nowadays 
the only published promising approach to the future realistic simulation of various 
contactors, including extraction columns, and their control. 
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3. The Parallelization Approach 



3.1 Initial Considerations 

A sequential C code was written for the algorithm to ensure that the calculations in 
each time step only need the results from the previous iteration. 

The analysis of the logical units of this sequential code pointed out the 
methodology used to develop a parallel version of the algorithm. Table 2 clearly 
shows that the most time eonsuming routine is the one responsible for ealculating the 
drop birth and death frequencies (due to drop breakage, coalescence, and transport) in 
each time step and for each column stage (dXdt routine). The time taken by the 
execution of the other routines is relatively insignificant and is not shown in Table 2. 
The parallel version of the algorithm is thus based on the partition of the ealculation 
of these frequencies, for each time step, among the several processors available. The 
synehronization is made at the end of each iteration. 



Name 


Time 

(%) 


Secs 


Calls 


Calls 

(ms/call) 


Total 

(ms/call) 


dXdt 


87.40 


2.29 


5040 


0.45 


0.49 


TimeStep 


4.20 


0.11 


20 


5.50 


131.00 



Table 2. The most time consuming routines 



3.2 The MPI Implementation 

The parallel program was implemented in C for a distributed memory parallel- 
computing environment using MPI (MPICH, 1.1.2.). 

The flowchart below shows that all of the processes call the TimeStep routine. In 
this routine, the master sends a sequence of stages for each one of the other processes, 
keeping the first group for itself Each process also receives the last stage of the 
previous process, since this information is needed for the calculations. All the 
processes, including the master, contribute to the calculation, calling the dXdt 
routine. The master receives all the results sent by the other processes at the end of 
each time step and performs the control calculations, such as the overall hold-up and 
the verification of an eventual column flooding situation. 

In order to minimize the overload due to information exchange, presently about 
4KB for each stage (13,3 KB when mass transfer is included), every information was 
sent once (MPI_ISend), taking advantage of the count and derived types MPI 
parameters. 

The program was first tested both on a heterogeneous cluster and on a 
homogeneous one. On the heterogeneous cluster, from the Engineering Faculty of the 
University of Porto, five Alpha processors were used, with different clock rates, 150 
MHz (2 nodes), 175 MHz (2 nodes) and 266 MHz (1 node). A 100 Mbps FDDI 
crossbar switch (Digital Equipment Corporation/Compaq GIGAswitch) coimects 
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these nodes. The operating system is Tme64 Unix v4.0E. On the homogeneous 
cluster, from the Dolphin [12] project of the Science Faculty of the University of 
Porto, four dual Pentium II, 300 MFIz processors, interconnected by a Myrinet 
network, were used. The operating system was Linux Redhat 6.0. 



Master (process pO) Processes pi, .... pn-1 




Fig. 3. The parallel algorithm 
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Master (process pO) 



Processes pi, ...,pn-l 




Fig. 4. The TimeStep routine 



Besides validation of the results, the possibility of using these different 
computation environments enabled us to identify problems in preserving efficiency 
for heterogeneous clusters [13]. The comparison of Fig.5 and Fig. 6, that show the 
monitor results of the jumpshot public domain utility, already discloses these 
problems. These figures show the inter-process communications for the 
heterogeneous cluster, with five processors, and for the homogeneous cluster, with six 
processors, both for a drop volume discretization of 20 classes. The black blocks 
represent the time consumed by the dXdt routine, and gray blocks refer to the 
TimeStep routine. The white arrows represent the stage exchanges between the 
processes. 

On the homogeneous cluster, with a drop volume discretization of 100 classes, the 
results obtained with six processors showed speedups exceeding a factor of four 
(Fig. 7). This result, for a realistic problem dimension, points out that parallelization 
pays off for the intended application [13]. 
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Fig. 5 Jumpshot result for the heterogeneous cluster 




Fig. 6 Jumpshot result for the homogeneous cluster 



Speedup 




Fig. 7. Speedup for 100 classes, with the cluster of project Dolphin 
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4. Results and Discussion 

To envisage the future application of such parallel program in industry, a 
homogeneous dedicated cluster was selected. It is important to stress, at this point, 
that MPI doesn’t respond dynamically to the potential inefficiencies caused by non- 
uniform computing speeds of the cluster nodes and the variability of shared resources 

[14]. 

The program was executed on the Beowulf Cluster of the Engineering Faculty of 
the University of Porto. The present configuration of this cluster of commodity PCs is 
one front-end node and twenty-two computing nodes. The front-end is a dual 
Pentium III 550 MHz processor, with 512 MB of memory and 18 GB of disk. Each 
computing node is a single 450 MHz Pentium III, with 128 MB of memory and 6 GB 
of disk. The nodes are connected using a Fast Ethernet BayNetworks 450-24 port 
switch. The operating system is Linux Slackware 7.0 [15]. The results obtained are 
presented in Fig. 8 and Fig. 9. 

These results show speedups exceeding a factor of six, with eighteen processors, 
for a drop volume discretization of 100 classes. It can be observed that speedup, 
although increasing, shows some plateaus. For instance, between nine and eleven 
processors, speedup stabilizes and goes up again for twelve processors. Notice that 
nine and twelve divide thirty-six, which is the number of stages of the column. From 
twelve to seventeen processors we again have a plateau, and another at a higher level, 
from eighteen to twenty two processors. Eighteen also divides thirty-six. These 
performance leaps are related to the way in which we distribute the work for the 
various processors. First, when the number of processors divides the number of 
stages, the workload is equally distributed. Second, granularity decreases as 
communication time increases, and the calculation time per processor decreases. 




Fig. 8. Elapsed time for 100 classes 

The speedup results for different discretizations of the drop phase-space, 50 and 
100 drop volume classes, are shown in Fig. 9. For twenty-two processors and 300 
time-steps, the results show a speedup increase from 3.71 to 5.97, being higher for the 
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finer distribution. With 100 drop volume discretization classes and four processors, 
simulation is already faster than the real process. 



Speedup 




n° of processors 



Fig. 9. Speedup for 50 and 100 classes 



5. Conclusions and Future Work 

The application that motivated this work was the simulation of the dynamic behaviour 
of liquid-liquid agitated columns. Execution times associated with sequential 
algorithms previously published by some of the authors need to be improved, in order 
to consider their application to real-time plant control applications. 

Clustered systems, using commodity processors and standard Ethernet networks, 
are increasingly popular, in face of their low price/performance ratio. 

We have shown that PC clusters are well suited for the intended application. The 
results presented in section 4 lead to the conclusion that parallelization pays off for 
the numerical technique used, based upon a space-time discretization and a stepping 
procedure, with explicit calculation of time derivatives. The fact that the speedup 
increases with the problem size is an important result for the future work, because 
mass transfer simulations involve much heavier calculations than the hydrodynamics. 

Extensions of the algorithm to include mass transfer are presently under 
development, as well as studies concerning the optimization of the drop interaction 
constants and transport parameters. 

On this version of the parallel program, the master is responsible for all global 
calculations, besides its own stage calculations, as a separate process. With this 
approach, all the communications are made only between the master and the other 
processes. Work is in progress to test another methodology, where all processes take 
part of the global calculations, implying communication between the i process and the 
i-1 process, instead of all process communications being with the master. This 
solution takes work from the master but increments communication between the 
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processes. The analysis of the results will show whether, with this other 
communication and work distribution scheme, speedup can be further improved. 
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Abstract. In this paper, a general introduction to the large-eddy sim- 
ulation (LES) technique will be given. Modeling and numerical issues 
that are under study will be described to illustrate the capabilities and 
requirements of this techniques. A palette of applications will then be 
presented, chosen on the basis both of their scientific and technologi- 
cal importance, and to highlight the application of LES on a range of 
machines, with widely different computational capabilities. 



1 Introduction 

Turbulent flows are ubiquitous in nature and in technological applications. They 
occur in such diverse fields as meteorology, astrophysics, aerospace, mechanical, 
chemical and environmental engineering. For this reason, turbulence has been 
the object of study for many centuries. In 1510, Leonardo da Vinci accompanied 
a drawing of the vortices shed behind a blunt obstacle (Fig. 1) with the following 
observation: 

Observe the motion of the water surface, which resembles that of hair, 
that has two motions: one due to the weight of the shaft, the other to the 
shape of the curls; thus, water has eddying motions, one part of which is 
due to the principal current, the other to the random and reverse motion. 

Despite its importance, and the number of researchers that have studied it the- 
oretically, experimentally and, recently, numerically, turbulence remains one of 
the open problems in Mechanics. 

The equations that govern turbulent flows are the Navier-Stokes equations. 
For turbulent flows, no exact solutions are available, and their numerical solution 
is made difficult by the fact that an accurate calculation depends critically on 
the accurate representation, in space and time, of the coherent fluid structures 
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Fig. 1. Sketch from Leonardo da Vinci’s notebooks. 



(eddies) that govern to a very large extent the transfer of momentum and en- 
ergy. The direct solution of the Navier-Stokes equations (also known as “direct 
numerical simulation”, or DNS) is an extremely expensive endeavor in turbulent 
flows. Its cost depends on the cube of the Reynolds number, the dimensionless 
parameter that measures the relative importance of convective and diffusive ef- 
fects. At present, DNS calculations are limited to flows with Reynolds numbers 
O(IO^), while most engineering and geophysical applications are characterized 
by Re = 0(10® - 10®). 

Practical, predictive, calculations require the use of simplified models. The 
most commonly used one is the solution of the Reynolds-averaged Navier-Stokes 
equations (RANS), in which the flow variables are decomposed into a mean and 
a fluctuating part, as fore-shadowed in da Vinci’s observations, and the effect 
of the turbulent eddies is parameterized globally, through some more-or-less 
complex turbulence model. This technique is widespread in industrial practice, 
but turbulence models are found to require ad hoc adjustments from one flow to 
another, due to the strongly flow-dependent nature of the largest eddies, which 
contribute most to the energy and momentum transfer, and which depend to 
a very significant extent on the boundary conditions. Furthermore, they fail 
to give any information on the wavenumber and frequency distribution of the 
turbulent eddies, which may be important in acoustics, or in problems involving 
the interaction of fluid with solid structures. 

The large-eddy simulation (LES) is a technique intermediate between DNS 
and RANS, which relies on computing accurately the dynamics of the large eddies 
while modeling the small, subgrid scales of motion. This method is based on the 
consideration that, while the large eddies are flow-dependent, the small scales 
tend to be more universal, as well as isotropic. Furthermore, they react more 
rapidly to perturbations, and recover eqnilibrium quickly. Thus, the modelling 
of the subgrid scales is significantly simpler than that of the large scales, and 
can be more accurate. 
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Despite the fact that the small scales are modeled, LES remains a fairly 
computationally intensive technique. Since the motion of the large scales must 
be computed accurately in time and space, fine grids (or high-order schemes) and 
small time-steps are required. Since the tnrbulent motions are intrinsically three- 
dimensional (3D), even flows that are two- or one-dimensional in the mean must 
be computed using a 3D approach. Finally, to accumulate the averaged statistics 
needed for the engineering design and analysis, the equations of motion must be 
integrated for long times. 

As a result of these computational requirements, nntil recently LES has been 
a research tool, used mostly in academic environments and research laboratories 
to study the physics of turbulence. Most calculations were carried out on vector 
machines (Cray X-MP, Y-MP and C90, for instance). Typical computations of 
flows at moderate Reynolds number required up to 1 million grid points, and 
used times of the order of 100 CPU hours and more on such machines. 

Recently, progress has been made on two fronts. First, the development of 
advanced models [4,5] for the small-scale contribution to momentum transfer, 
the subgrid-scale stresses, allows the accurate prediction of the response of the 
small scales even in non-equilibrinm situations. Secondly, the decreasing cost 
of computational power has made it possible to perform larger simulations on 
a day-to-day basis, even using inexpensive desktop workstations. Simulations 
using 3 million grid points can easily be run on Pentium-based computers. The 
turn-around time for a mixing-layer simulation that used 5 million points on a 
dedicated Alpha processor is of the order of two days per integral scale of the 
flow, a time comparable to what was achievable on a Cray, in which the greater 
processor speed was often offset by the load of the machine, and the end-user was 
frequently restricted to a few CPU hours per day. With the increased availability 
of inexpensive workstation clusters, the application of LES is bonnd to become 
more and more affordable. The use of large, massively parallel computers is, 
however, still required by very advanced, complex applications that may require 
very large numbers of grid points [0(10^)], and correspondingly long integration 
times. 

In this article, a general introduction to LES will be given. Although partic- 
ular emphasis will be placed on numerical issues, the main thrust of the paper 
will not be the algorithmic problems and developments, but rather a discussion 
of the capabilities and computational requirements of this techniques. A palette 
of applications will then be presented, chosen on the basis both of their scientific 
and technological importance, and to highlight the application of LES on a range 
of machines, with widely different computational capabilities. This article should 
not be seen as a comprehensive review of the area; the reader interested in more 
in-depth discussions of the subject is addressed to several recent reviews [1,2,3]. 

2 Governing Equations 

The range of scales present in a turbulent flow is a strong function of the Reynolds 
number. Consider for instance the mixing layer shown in Fig. 2. The largest 




554 



Ugo Piomelli, Alberto Scotti, and Elias Balaras 




Large structures ^ Small structures 

scale L scale t| 



Fig. 2. Visualization of the flow in a mixing layer (from Brown & Roshko [6]). 
The flow is from left to right; a splitter plate (immediately to the left of the 
image) separates a high-speed flow (top) from a low-speed one. The two streams 
then mix, forming the large, quasi-2D rollers in the figure, as well as a range of 
smaller scales. 



eddies in this flow are the spanwise rollers, whose scale is T; a very wide range 
of smaller scales is present. The energy supplied to the largest turbulent eddies 
by the mean flow is transferred to smaller and smaller scales (energy cascade), 
and eventually dissipated into heat by the smallest ones. Most of the energy, in 
fact, is dissipated by eddies contained in a length scale band of about 6r] to 60r], 
where ry is the so-called Kolmogorov scale. 

In DNS, all the scales of motion, up to and including the dissipative scales of 
order rj must be resolved; since the computational domain must be significantly 
larger than the large scale L, while the grid size must be of order rj, the number 
of grid points required is proportional to the ratio L/ry. It can be shown that 
this ratio is proportional to where the Reynolds number Re = AUL/v is 

based on the velocity difference between the two streams, AU , and an integral 
scale of the flow, T; v is the kinematic viscosity of the fluid. Thus, the number 
of grid points needed to perform a three-dimensional DNS scales like the 9/4 
power of the Reynolds number. 

The time-scale of the smallest eddies also supplies a bound for the maximum 
time-step allowed: since the ratio of the integral time-scale of the flow to the 
Kolmogorov time-scale is also proportional to the number of time-steps 

required to advance the solution by a fixed time has the same dependence on Re. 
Assuming that the CPU time required by a numerical algorithm is proportional 
to the total number of points N, the cost of a calculation will depend on the 
product of the number of points by the number of time-steps, hence to Re^ . 

In an LES only the large scales of motion must be resolved. The similarity of 
the small scales, which only transmit energy to smaller scales, and the fact that 
the global dissipation level is set by the large scales (even though the dissipation 
takes place at the small-scale level) are exploited by SGS models, whose main 
purpose is to reproduce the energy transfer accurately, at least in a statistical 
sense. When the filter cutoff is in the inertial region of the spectrum (i.e., in 
the wave-number range in which the energy cascade takes place), therefore, the 
resolution required by an LES is nearly independent of the Reynolds number. 
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In wall-bounded flows, in which the scale of the large, energy-carrying eddies 
is Reynolds- number-dependent, the situation is less favorable. The cost of an 
LES is still, however, signihcantly reduced over that of a DNS. 

To separate the large from the small scales, LES is based on the definition of 
a filtering operation: a filtered (or resolved, or large-scale) variable, denoted by 
an overbar, is defined as 



/W = /" f{x.')G{x,x.';A)dx.', (1) 

JD 

where D is the entire domain, G is the filter function, and A, the filter width, is a 
parameter that determines the size of the largest eddy removed by the filtering 
operation. The filter function determines the size and structure of the small 
scales. It is easy to show that, if G is a function of x — x' only, differentiation 
and the filtering operation commute [7]. 

The most commonly-used filter functions are the sharp Fourier cutoff filter, 
best defined in wave space ^ 



dm = ( 1 ‘If - 

10 otherwise, 



the Gaussian filter, 




and the top-hat filter in real space: 



r 1/Z\if |xl < A/2 
\ 0 otherwise. 



(2) 

( 3 ) 

( 4 ) 



For uniform filter width A the filters above are mean-preserving and commute 
with differentiation. 

The effect of filtering a test function with increasing filter-width is shown in 
Fig. 3. Although an increasing range of small scales is removed as A is increased, 
the large-scale structure of the signal is preserved. In RANS, on the other hand, 
the effect of all turbulent eddies would be removed by the averaging procedure. 

In LES the filtering operation (1) is applied formally to the governing equa- 
tions; this results in the filtered equations of motion, which are solved in LES. 
For an incompressible flow of a Newtonian fluid, they take the following form: 




dui 




1 dp 
p dxi 



dnj 8‘^Ui 

U . 

dxj dxj dxj 



( 5 ) 

(6) 



^ A quantity denoted by a caret T jg the complex Fourier coefficient of the original 
quantity. 




556 



Ugo Piomelli, Alberto Scotti, and Elias Balaras 




Fig. 3. Effect of filtering a test function with increasing filter-width A. 



The filtered Navier-Stokes equations written above govern the evolution of the 
large, energy-carrying, scales of motion. The effect of the small scales appears 
through a subgrid-scale (SGS) stress term, 

Tij=U^-UiUj, (7) 

that must be modeled to achieve closure of the system of equations. 

3 Subgrid-Scale Models 

In LES the dissipative scales of motion are resolved poorly, or not at all. The 
main role of the subgrid-scale model is, therefore, to remove energy from the 
resolved scales, mimicking the drain that is associated with the energy cascade. 
Most subgrid scale models are eddy- viscosity models of the form 

'^ij (8) 

that relate the subgrid-scale stresses to the large-scale strain-rate tensor Sij = 
{duijdxj -\- duj I dxi) 12. In most cases the equilibrium assumption (namely, that 
the small scales are in equilibrium, and dissipate entirely and instantaneously all 
the energy they receive from the resolved ones) is made to simplify the problem 
further and obtain an algebraic model for the eddy viscosity [8] : 

i^T = CA^\S\S,j- 1^1 = {2Si,Si,f/‘^. (9) 

This model is known as the “Smagorinsky model” . The value of the coefficient C 
can be determined from isotropic turbulence decay [9] ; if the cutoff in the inertial 
subrange, the Smagorinsky constant Cg = VC takes values between 0.18 and 
0.23 (and C ~ 0.032 — 0.053). In the presence of shear, near solid boundaries 
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or in transitional follows, however, it has been found that C must be decreased. 
This has been accomplished by various types of ad hoc corrections such as van 
Driest damping [10] or intermittency functions [11]. 

More advanced models, that do not suffer from the shortcomings of the 
Smagorinsky model (excessive dissipation, incorrect asymptotic behavior near 
solid surfaces, need to adjust the constant in regions of laminar flow or high 
shear) have been developed recently. The introduction of dynamic modeling 
ideas [4] has spurred significant progress in the subgrid-scale modeling of non- 
equilibrium flows. In dynamic models the coefficient (s) of the model are deter- 
mined as the calculation progresses, based on the energy content of the smallest 
resolved scale, rather than input a priori as in the standard Smagorinsky [8] 
model. A modification of this model was proposed by Meneveau et al. [5] , which 
has been shown to give accurate results in non-equilibrium flows in which other 
models fail [12]. 

Turbulence theory (in particular the Eddy-Damped Quasi-Normal Markovian 
theory) has also been successful in aiding the development of SGS models. The 
Chollet-Lesieur [13,14] model, as well as the structure-function [15] and filtered- 
structure-function models [16] have been applied with some success to several 
flows. 

A detailed discussion of SGS models is beyond the scope of this paper. The 
interested reader is referred to the review articles referenced above [1,2,3]. 

4 Numerical Methods 

In large-eddy simulations the governing equations (5-6) are discretized and solved 
numerically. Although only the large scales of motion are resolved, the range of 
scales present is still significant. In this section, a brief overview of the numerical 
requirements of LES will be given. 



4.1 Time Advancement 

The choice of the time advancement method is usually determined by the re- 
quirements that numerical stability be assured, and that the turbulent motions 
be accurately resolved in time. Two stability limits apply to large-eddy simula- 
tions. The first is the viscous condition, that requires that the time-step At be 
less than Aty = uAy^ jv (where cr depends on the time advancement chosen). 
The GEL condition requires that At be less than Ate = CELZ\x/u, where the 
maximum allowable Courant number CEL also depends on the numerical scheme 
used. Einally, the physical constraint requires At to be less than the time scale 
of the smallest resolved scale of motion, r ~ Ax/Uc (where Uc is a convective 
velocity of the same order as the outer velocity) . 

In many cases (especially in wall-bounded flows, and at low Reynolds num- 
bers), the viscous condition demands a much smaller time-step than the other 
two; for this reason, the diffusive terms of the governing equations are often 
advanced using implicit schemes (typically, the second-order Crank-Nicolson 
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scheme). Since, however, Ate and r are of the same order of magnitude, the 
convective term can be advanced by explicit schemes such as the second-order 
Adams-Bashforth method, or third- or fourth-order Runge-Kutta schemes. 




Fig. 4. Modified wave-number for various differencing schemes. 



4.2 Spatial Discretization 

The analytical derivative of a complex exponential f{x) = is f'{x) = 
if / is differentiated numerically, however, the result is 

(10) 

ox 

where k' is the “modified wave-number” . A modified wave-number corresponds 
to each differencing scheme. Its real part represents the attenuation of the com- 
puted derivative compared to the actual one, whereas a non-zero imaginary part 
of k' indicates that phase errors are introduced by the numerical differentiation. 
Figure 4 shows the real part of the modified wave-numbers for various schemes. 
For a second-order centered scheme, for instance, k' = k sm{kAx) / {kAx) . For 
small wave-numbers k the numerical derivative is quite accurate; high wave- 
number fluctuations, however, are resolved poorly. No phase errors are intro- 
duced. 

The need to resolve accurately high wave-number turbulent fluctuations im- 
plies that either low-order schemes are used on very fine meshes (such that, for 
the smallest scales that are physically important, k' k), or that higher-order 
schemes are employed on coarser meshes. High-order schemes are more expen- 
sive, in terms of computational resources, than low-order ones, but the increase 
in accuracy they afford (for a given mesh) often justifies their use. 
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4.3 Conservation 



It is particularly important, in large-eddy simulations of transitional and turbu- 
lent flows, that the numerical scheme preserves the conservation properties of 
the Navier-Stokes equations. In the limit Re oo, the Navier-Stokes equations 
conserve mass, momentum, energy and vorticity in the interior of the flow: the 
integral of these quantities over the computational domain can only be affected 
through the boundaries. Some numerical schemes, however, do not preserve this 
property. For instance, the convective term in the momentum equations can be 
cast in several ways: 



Advective form 
Divergence form 
Rotational form 
Skew-symmetric form 



dui 



dx, ’ 



dx. 



-{UiUj), 






dui 
' dx-i 



d 

dxi ' 

d 



dxj 



(UiUj) 



( 11 ) 

(12) 

(13) 

(14) 



where u>k = Ckijduj /dxi. It is easy to show (Morinishi et al. [17]) that, if a typical 
co-located finite-difference scheme is used, the first form does not conserve either 
momentum or energy, the second conserves momentum but not energy, the others 
conserve both. If, on the other hand, a control-volume approach is used, the 
divergence form conserves energy but the pressure-gradient term does not. With 
a staggered grid, the divergence form preserves the conservation properties of the 
Navier-Stokes equations if central, second-order accurate differences are used. 

Upwind schemes also have very undesirable effects on the conservation prop- 
erties of the calculation, as does the explicit addition of artificial dissipation. 
Even mildly upwind-biased schemes result in a significant loss of accuracy. These 
methods are not suited to LES of incompressible flows, and should be avoided. 



4.4 Complex Geometries 

For applications to complex geometries, single-block, Cartesian meshes are inad- 
equate, since they do not give the required flexibility. One alternative is the use 
of body-fitted curvilinear grids. LES codes in generalized coordinates have been 
used, among others by Zang et al. [21,22] (who applied it to a Cartesian geom- 
etry, the lid-driven cavity [21], and to the study of coastal up-welling [22,23]), 
Beaudan and Moin [24] and Jordan [25]. Jordan [25] examined the issue of fil- 
tering in curvilinear coordinates, and concluded that filtering the transformed 
(in the generalized coordinates) equations directly in the computational space is 
better than performing the filtering either of the transformed equations in real 
space, or of the untransformed equations in Cartesian space. 

Even if curvilinear grids are used, the application of LES to complex ge- 
ometries might be limited by resolution requirements. In the presence of a solid 
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boundary, for instance, a very fine mesh is required to resolve the wall layer. 
Kravchenko et al. [26] used zonal embedded meshes and a numerical method 
based on B-splines to compute the flow in a two-dimensional channel, and around 
a circular cylinder. The use of the B-splines allows use of an arbitrarily high or- 
der of accuracy for the differentiation, and accurate interpolation at the interface 
between the zones. A typical grid for the channel flow simulations is shown in 
Fig. 5, which evidences the different spanwise resolution in the various layers, 
in addition to the traditional stretching in the wall-normal direction. The use of 
zonal grids allowed Kravchenko et al. [26] to increase the Reynolds number of the 
calculations substantially: they performed an LES of the flow at Rec = 109 410 
using 9 embedded zones allowed them to resolve the wall-layer using a total 
of 2 million points. A single-zone mesh with the same resolution would have 
under-resolved the wall layer severely. The mean velocity profile was in excellent 
agreement with the experimental data. 




Fig. 5. Zonal embedded grid with fine grid zones near the walls and coarse 
zones in the middle of the channel. The flow is into the paper. Reproduced with 
permission from Kravchenko et al. [26] 



Very few applications of LES on unstructured meshes have been reported to 
date. Jansen [29] showed results for isotropic turbulence and plane channel. For 
the plane channel, the results were in fair agreement with DNS data (the peak 
streamwise turbulence intensity, for instance, was 15% higher than that obtained 
in the DNS), but slightly better than the results of finite-difference calculations 
on the same mesh. Simulations of the flow over a low-Reynolds number airfoil 
using this method [28] were in fair agreement with experimental data. Knight 
et al. [30] computed isotropic turbulence decay using tetrahedral meshes, and 
compared the Smagorinsky model with results obtained relying on the numerical 
dissipation to drain energy from the large scales. They found that the inclusion 
of an SGS model gave improved results. 



Large-Eddy Simulations of Turbulent Flows 561 



While high-order schemes can be applied fairly easily in simple geometries, 
in complex configurations their use is rather difficult. Present applications of 
LES to relatively complex flows, therefore, tend to use second-order schemes; the 
increasing use of LES on body- fitted grids for applications to flows of engineering 
interest, indicates that, at least in the immediate future, second-order accurate 
schemes are going to increase their popularity, at the expense of the spectral 
methods that have been used frequently in the past. Explicit filtering of the 
governing equations, with filter widths larger than the grid size may be required 
in such circumstances. 



5 Applications: Flow in an Accelerating Boundary Layer 

A boundary layer is the region of fluid flow nearest to a solid body, in which 
viscous effects (i.e., diffusion) are important. Turbulent boundary layers occur in 
many technological applications, and are often subjected to favorable pressure 
gradients that result in an acceleration of the velocity at the edge of the boundary 
layer, the free-stream velocity. Figure 5 illustrates schematically the boundary 
layer that occurs at the leading edge of an airplane wing. The fluid is accelerated 
as it turns over the top side of the airfoil from the stagnation point, where its 
velocity is zero. 




Fig. 6. Sketch of the flow near the leading edge of an airfoil. 



Despite the importance of this type of flow fields, however, they are not as 
well understood as the canonical zero-pressure-gradient boundary layer, due to 
the much wider parameter space, and to the difficulty in determining universal 
scaling laws similar to those for the zero-pressure-gradient case. In fact, a large 
percentage of the investigations of accelerating flows to date have concentrated 
on self-similar cases, in which such scaling laws can be found. 

It is recognized that, if the acceleration is sufficiently strong, turbulence can- 
not be sustained. In self-similar accelerating boundary layer, this phenomenon 
takes place when the acceleration parameter K reaches a critical value: 



/z dU^ 
dx 



ly dP„ 



pUL dx 



3 X 10"®. 



(15) 
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The RANS approach, which is often used in aeronautical applications, has 
difficulty dealing with reversion of a turbulent flow to a laminar one, and into 
the re-transition of the flow, that becomes turbulent again as the acceleration 
ceases on the suction (upper) side of the airfoil. Large-eddy simulation can help 
in understanding the physics that cause reversion and re-transition, as well as 
provide accurate data that can be used for the development and validation of 
lower- level RANS models to be used in engineering design. 

In particular, experimental evidence indicates that the dynamics of the coher- 
ent eddies play an important role in the reversion. An improved understanding 
of the dynamics of these eddies in boundary layers subjected to a favorable 
pressure gradient would be extremely beneficial. Apart from the considerations 
about momentum transfer and mixing also valid in other flows, an additional mo- 
tivating factor is provided here by the consideration that most of the theoretical 
attempts to derive scaling laws are often based on multiple-scale approximations 
that assume little or no interaction between inner and outer layers. The most 
direct way to establish the validity of this assumption is by studying the coherent 
eddies in the wall layer. Unlike RANS solutions, in which only the average flow- 
field is computed, LES can supply information on the behavior of the coherent 
structures. 

Piomelli and co-workers [31] studied the velocity fields obtained from the 
large-eddy simulation (LES) of accelerating boundary layers with the aim to 
improve the understanding of the dynamics of the coherent vortices in the re- 
laminarizing flows. To separate the effect of the pressure gradient from that of 
curvature, the calculation of the boundary layer on a flat plate with an accel- 
erating free-stream was carried out; the configuration is similar to the flow on 
the lower wall of a wind-tunnel in which the upper wall converges, as sketched 
in Fig. 7. The computational domain is the shaded area in the figure. 




Two computations were examined: one in which the acceleration is relatively 
mild (the maximum velocity increases by 35% over the computational domain, 
and K < 3 X 10“® everywhere), and a strong-acceleration case in which the 
velocity increases by almost 150%, and AT > 3 x 10“® for a significant portion of 
the flow. The modification of the turbulence structure in accelerating flows was 
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emphasized, and it was shown how the acceleration can be associated to lower 
turbulence levels and to the dynamics of the quasi-streamwise coherent vortices. 



5.1 Numerical Method 



The governing equations(5-6) are integrated numerically using the fractional 
time-step method [18,19], in which first the Helmholtz equation is solved to ob- 
tain an estimate of the velocity field that does not satisfy mass conservation; the 
pressure is then computed by solving Poisson’s equation, the estimated velocity 
field supplying the source term. When a pressure correction is applied, the re- 
sulting velocity will be a divergence-free solution of the Navier-Stokes equations. 
If the Navier-Stokes equations are written as 





dui _ dp 2 - 

^ — O ^ V 1/-2, 

Ot OXi 


(16) 


where Hi contains the nonlinear term and the SGS stresses, the time-advancement 
sequence based on the second-order-accurate Adams-Bashforth method consists 


of the following steps: 






1. Velocity prediction (Helmholtz equation): 




Vj — = At 


^{-H^ + V%/) - + V%/-1) 


; (17) 


2. Poisson solution: 


r> 1 dVi 


(18) 






3. Velocity correction: 


= Vj - 

dXj 


(19) 



Vj is the estimated velocity. This time-advancement scheme is second-order- 
accurate in time. The code uses central differences on a staggered mesh, and is 
second-order accurate in space as well. Discretization of the Poisson equation 
(18) results in an hepta-diagonal matrix that can be solved directly if the grid 
is uniform in at least one direction. 

The calculations were performed on a domain of size 400x25x25. All lengths 
are normalized with respect to the inflow displacement thickness ^ * ; the displace- 
ment thickness is an integral length scale defined as 



where U is the average streamwise velocity. The calculations used 256x48x64 
grid points. A grid-refinement study was performed in the strong-acceleration 
case, in which the number of grid points was increased by 50% in each direction. 
In the accelerating- flow region {x/5* < 320) the results on the coarser mesh 
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matched very well those obtained with the finer one. In the re-transitioning 
area, the qualitative behaviour of the flow was captured correctly, but some 
differences (of the order of 15%) were observed in the statistical quantities. The 
Lagrangian dynamic eddy viscosity model [5] was used to parameterize the SGS 
stresses. 

The cost of the computations, was 2.2 x 10“^ CPU seconds per time-step 
and grid point on a 300 MHz Pentium II running Linux. Out of this CPU time, 
37% is devoted to the computation of the RHS, 25% to the computation of the 
turbulent viscosity, 12% to solve the Poisson equation, and 10% to update the 
velocity field and impose boundary conditions. The rest of the CPU is consumed 
by I/O and computation of statistical quantities. Typically a computation on a 
10® grid requires approximately 42 hours of CPU to obtain converged statistics 
(sampling over 1.5 flow-through times). It is interesting to observe that the cost 
of solving the Poisson equation is a small fraction of the total cost when a direct 
solver (as in the present case) is used. Any other choice of solution method, 
like multigrid methods, conjugate gradient methods, etc. would substantially 
increase the cost of this step, which can account for a large fraction of the total 
cost, depending on the problem and the computational grid. 



5.2 Results 



The free-stream velocity obtained from the calculation, C/qo, the pressure param- 
eter K and the momentum-thickness Reynolds number, Reo = 0Uao/i^, where 9 
is the momentum thickness defined as 



9 = 





U 

uZ. 



dy, 



( 21 ) 



are shown in Fig. 8 for the two cases examined. In the strong acceleration case, 
despite the presence of a fairly extended region in which K exceeds 3 x 10“®, 
the Reynolds number never goes below the critical value Reg ~ 350. Thus one 
would expect the flow to become less turbulent, but not to revert fully into a 
laminar one. 

The streamwise development of several time-averaged and integral quantities 
is shown in Fig. 9. As a result of the free-stream acceleration, the boundary layer 
becomes thinner, as shown by the distributions of 5* and 9. The skin friction 
coefficient based on the local free-stream velocity, Cf = 2 tw/pU'^ (where Tw 
is the wall stress), initially increases, but, as the flow begins to relaminarize, it 
decreases in both the mild- and strong-acceleration case. 

Although the pressure-gradient parameter K is well above the critical value 
of 3 X 10“® in the strongly accelerating case, the acceleration is not sustained 
long enough for the Reynolds number to be reduced below the critical value, 
Reg ~ 350. Thus, full relaminarization does not occur; the shape factor H only 
reaches a value of 1.6 (the shape factor associated with the laminar Falkner- 
Skan similarity profile for sink flows of this type is 2.24). The mean velocity 
profile, however, is significantly affected by the acceleration, even in the mild 
acceleration case. 
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Fig. 8. Spatial development of the free-stream velocity Uoo, the acceleration 
parameter K, and the momentum-thickness Reynolds number Reg in the accel- 
erating boundary layer. 




Fig. 9. Spatial development of mean quantities in the accelerating boundary 
layer, (a) Displacement thickness 6*; (b) shape factor H; (c) skin- friction coef- 
ficient Cf. 
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Fig. 10. Contours of the turbulent kinetic energy, normalized by the free-stream 
kinetic energy in the accelerating boundary layer. 



As the flow is gradually accelerated, the turbulence adjusts to the perturba- 
tion; the turbulent quantities, however, lag the mean flow. The turbulent kinetic 
energy, for instance, increases in absolute levels, although not as fast as the ki- 
netic energy of the mean flow. Thus, the contours of the turbulent kinetic energy 
normalized by the kinetic energy of the mean flow, shown in Fig. 10, highlight 
a significant drop in the turbulent kinetic energy in the region of acceleration. 

Paradoxically, in many turbulent flows, whenever energy is added through 
the mean flow, the energy of the turbulence initially decreases, as the coherent 
vortices adapt to the perturbation. This process often involves disruption of the 
vortical structures prior to their re-generation. Such is the case in this config- 
uration as well: the vortical structures are visualized in Fig. 11 as isosurfaces 
of the second invariant of the velocity-gradient tensor, Q, a useful quantity to 
visualize the regions of high rotation that correspond to the coherent vortices. 
In the zero-pressure-gradient region near the inflow (top picture) many vortices 
can be observed, and they are roughly aligned with the flow direction, but form 
an angle to the wall. This picture is typical of zero-pressure-gradient boundary 
layers. In the accelerating region, on the other hand, fewer eddies are observed, 
and those present are more elongated and more closely aligned in the streamwise 
direction. This structure can be explained based on the fact that the mean ve- 
locity gradient has the effect of stretching and re-orienting the vortices. As they 
are stretched, their vorticity is increased by conservation of angular momentum, 
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while their radius is decreased. The smaller, more intense eddies thus generated 
are more susceptible to be dissipated by viscous effects. 




Fig. 11. Instantaneous iso-surfaces of Q{5*/Uq)‘^ = 0.02 in the strong- 

acceleration case. Top: zero-pressure-gradient region. Bottom: acceleration re- 
gion. 

This calculation highlights a significant advantage of LES over lower-level 
models. Whenever the coherent eddies play an important role in the flow evolu- 
tion, RANS calculations (in which the effect of all turbulent eddies is averaged 
out) cannot predict the flow development accurately. LES, on the other hand, 
has a better chance of following the dynamics of the coherent structures, as well 
as their response to the imposed perturbations. 
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6 Applications: Flow in an Oscillating Channel 

6.1 Motivation 

Inherent unsteadiness of the driving conditions characterizes many turbulent 
flows, both natural {e.g. the gravity wave induced in ocean-bottom boundary 
layers, the blood flow in large arteries, the flow of air in lungs) and artificial 
(such as the flow in the intake of a combustion engine or the flow in certain 
heat exchangers). The characterization of unsteady boundary layers is crucial 
to many disciplines, such as the study of sediment transport in coastal waters, 
the biology of blood circulation, and so on; moreover, as was pointed out by 
Sarpkaya [32], by looking at features that are common to steady and unsteady 
boundary layers, we may better understand the underlying physics of turbulent 
flows altogether. As already recognized by Binder et al. [33] , there are no special 
technical difficulties in performing DXS of pulsating flows. On the other hand, 
the same authors point out that the oscillating nature of the forcing is felt by 
the small scales too, so that before trusting the outcome of a LES based on 
standard closures, a careful (a posteriori) comparison with DNS has to be done. 
This is particularly true for eddy viscosity models, which rely on the combined 
assumptions that the SGS stress tensor Tij is aligned with the rate of strain and 
that the eddy viscosity is proportional to the magnitude of the stress. The latter 
postulate is somewhat relaxed for the dynamic Smagorinsky model of Germane 
et al. [4], since the eddy viscosity depends on the flux of energy towards the 
subgrid scales. 



/SiP / L = A + Bco sin cor 




L 

Fig. 12. Sketch of the physical configuration. Oscillating channel flow. 



To study the response of turbulence to an oscillating mean flow, a plane- 
channel flow driven by an oscillating pressure gradient was studied. The physical 
configuration is illustrated in Fig. 12: the flow between two flat plates that 
extend to ±oo in the streamwise (x) and spanwise (y) directions is simulated. 
To drive this periodic flow, a pressure gradient per unit length is introduced 
on the right-hand-side of the Navier-Stokes equations as a source term. In the 
case under investigation, this pressure gradient is given by 1 x 10“"^ +u’sinu’t, 
where uj is the angular frequency of the oscillation. This is the kind of flow 
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considered by Binder et al. [33]. The flow admits a laminar solution, which 
is a trivial extension of the Stokes problem. The flow first decelerates (as it is 
subjected to the adverse pressure gradient during the first half of the cycle), then 
accelerates again. During the acceleration phase, as observed before, the flow 
tends to relaminarize, whereas the adverse-pressure-gradient has the opposite 
effect, and makes the flow more turbulent. 

Since the core of the flow, where the velocity is large, is dominated by con- 
vective effects, while the regions near the solid boundary, where the velocity gra- 
dients are significant, are dominated by diffusive effects, there is a disparity in 
time-scales between these two regions: the diffusive time-scale being smaller than 
the convective one by orders of magnitude. Thus, as the frequency is changed, 
one would expect a significantly different coupling between the near-wall region 
(the inner layer) and the core of the flow (the outer layer). To study this coupling, 
calculations were carried ont for a range of frequencies. 

Although the geometry is rather simple, and the grids used relatively coarse, 
this calculation still requires a large amount of CPU time. This is due to the long 
integration time necessary to achieve convergence. Since phase-averaged data is 
required, between eight and ten cycles of the oscillation are needed to obtain 
converged statistical samples. If the frequency is low, the equations of motion 
must be integrated for very long integration. 



6.2 Numerical Method 



The starting point for this calculation is a well-known serial spectral code for 
the solution of the filtered Navier-Stokes equation in a channel geometry [34,35]. 
Fourier expansions are used in the homogeneous (horizontal) directions, while 
Chebychev collocation is used in the vertical direction. The code is very highly 
optimized for a vector architecture. Time-advancement is performed using the 
fractional time-step method described above; however, the implicit Crank-Nicol- 
son method is used for the vertical diffusion and a low-storage third-order Runge- 
Kutta scheme is employed for the remaining terms. The procedure described in 
Section 5.1 still applies, with few obvious modifications. Each sub-step of the 
Runge-Kutta scheme follows the sequence: 



1. Compute the nonlinear terms iL” and the horizontal diffusive terms 

with information at time Both these terms are computed in real space. 

2. Transform the right-hand side iL” -I- into Fourier space. 

3. Update the predicted solution in Fourier space: 



1 - 



i/At 

dxl 



= 1 + 



i/At 

dxl 



AtH^‘ 



(22) 



by solving implicitly the vertical diffusive problem. Since a Chebychev col- 
location method is used in the vertical direction z a full matrix obtains for 
each mode, which is inverted iteratively by a Generalized Minimum Residual 
method. 
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4. Solve the Poisson problem for the pressure in Fourier space: 






- K 



P = 



1 



i{kivi + k2V2) + 



9x3 



(23) 



again using a Generalized Minimum Residual method to solve the system of 
linear equations. 

5. Update the solution: 



^n+l ^ c: ^n+1 ^ ^ ^ 

Ui = vi — tkip; U2 = V2 — tk2p; 



Us 



^ .dp 
= vs- — . 

ax3 



(24) 



An initial series of numerical experiments was performed on an SGI Origin 
2000 with 32 RIOOOO processors running at 195 MHz, each equipped with 640 
Mb of Ram and 4Mb of cache, owned by the University of North Carolina. Each 
processor is rated at 390 MFLOPS. Four discretizations were chosen, 32x32x32, 
64x64x64, 128x128x96 and 128x192x128. The serial code experience a drop 
in performance as the domain grows, from 60 MFLOPS to about 19. This is 
due to the limited cache, which acts as a bottleneck. The problem is made more 
acute by the fact that the discrete Fourier transform, which is the heart of the 
code, is a nonlocal operation. Frigo and Johnson [36] performed extensive testing 
of different FFT routines on cache based machines, and, without exception, all 
routines showed a marked slowdown when a certain critical size (both machine- 
and routine-dependent) is reached (see, for instance. Fig. 4 of their paper). 



6.3 The Parallel Code 

The current trend in supercomputer technology is towards achieving raw com- 
putational power by assembling a large number of relatively inexpensive nodes, 
based on mass produced RISC CPUs connected by high-speed data path. Ex- 
amples are the Origin 2000 by SGI (RIOOOO), the IBM SP/6000 (Power PC) 
and the Cray T3D (ALPHA). While it is appealing to be able to obtain large 
theoretical computational speeds at a fraction of the cost of traditional vector 
based machines, this paradigmatic shift requires a re-examination of the existing 
codes. A case in point is the spectral Navier-Stokes solver discussed in the pre- 
vious section. To parallelize it, we begin by noticing that the computationally 
intensive steps of solving the Helmholtz and Poisson problems amount to solving 
imax X jmax ID problems, where imax Umax) is the number of collocation points 
in the streamwise (spanwise) direction. 

The load then can be distributed among p processor. A Single Program Mul- 
tiple Data (SPMD) approach was adopted. Each process executes essentially 
the same operations on different portions of the domain, which are private to 
them. Message passing is used to exchange information between processes, using 
Message Passing Interface (MPI) library calls. 

The domain is sliced along either the z or the x direction, and each proces- 
sor owns a slice. During the computation of the nonlinear term, the domain is 
sliced along the z direction (see Fig. 13). When vertical derivatives are needed, 
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a transposition is performed, in which process j sends sub-block i of the do- 
main to process i, and receives in turn the sub-block j from process i. MPI 
implements this kind of alltoall scatter/gather operation transparently. After 
the nonlinear term is calculated, the domain is swapped so that each process 
owns vertical slices, and the Helmholtz and Poisson problems are solved with- 
out further swapping. At the end, the solution is swapped back into horizontal 
slices and the cycle begins again. Incidentally, this approach predates the use of 
parallel computers, being used for DNS on Cray X-MP to feed the fields into 
core memory one slice at a time (see, for example, [37]). 




Fig. 13. The domain is split among processes (CPUs) either along the 2 (left) 
or the X (right) coordinate. An alltoall scatter/gather is used to go from one 
configuration to the other. 



6.4 Speedup and Scalability 

The performance of a parallel program is measured by the speedup factor S', 
defined as the ratio between the execution time T^- of the serial program and the 
execution time of the parallel version (see Pacheco [38]). In our approach, 
the load is evenly balanced between processes (with the negligible exception of 
I/O, which is handled by one process), so that an equivalent measure is the 
efficiency E, defined as the ratio between and the total time consumed by 
the p processes. In general, for a given machine, E = E{n,p), where n is the 
size of the problem being solved. In Table 1 we show the efficiency for different 
values of n and p, with the relative MFLOPS in parenthesis. 

The striking result is that it is possible to achieve a super-linear speedup. 
This is made possible by the fact that the smaller parallel threads use the cache 
more efficiently than the serial code. For instance, for the grid 128x128x96, 
the serial code reuses on average a L2 cache line 4.6 times before discarding it; 
using 4 processes the value per process increases to 7.3, while with 8 becomes as 
high as 21.4. The gain is of course offset by the overhead generated by message 
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Size 


p = 2 


p = 4 


p = 8 


p — 16 


32x32x32 


.8 (84) 








64x64x64 


.93 (81) 


.81 (120) 






128x128x96 




1.1 (88) 


1.3 (168) 




128x192x128 






1.1 (168) 


.91 (276) 



Table 1. Efficiency of spectral parallel Navier Stokes solver and (in parentheses) 
achieved MFLOP rate. 



passing. However, due to the efficient implementation of MPl on the Origin 2000, 
we found that the time spent swapping data between processes represents less 
that 10% of the total time, in the worst case. 

6.5 Results 

The Reynolds number based on channel height and the time-averaged centerline 
velocity was 7500 for all calculations. Simulations were carried out for several 
values of the frequency of the driving pressure-gradient, resulting in a Reynolds 
number, based on the thickness of the laminar oscillating layer, 5 = 
and the oscillating component of the velocity, ranging between Re^ = 100 and 
1000. The low Res case was simulated using both a DNS on a 128 x 128 x 96 grid, 
and an LES using the dynamic eddy-viscosity model [4] on the same domain, 
discretized using a 32 x 32 x 49 grid. All the other cases were simulated only 
using the LES approach. 

Eigure 14 shows the centerline velocity (normalized by the Ur = {t^I 
where p is the fluid density and Tw is the shear stress at the wall) and Tw itself. 
The abscissa is the normalized phase, (p = tot. While the centerline velocity is in 
phase with the imposed pressure gradient, the wall stress is not. At high frequen- 
cies a sinusoidal shape is preserved, whereas for low frequencies the distribution 
of Tu, becomes very asymmetric. This is due to quasi-relaminarization of the flow 
during the acceleration phase, which is followed by a dramatic instability in the 
deceleration one. Good agreement with the DNS can be observed. 

Eigure 15 shows the mean velocity profiles at several phases. Good agreement 
is again observed between the LES and the DNS for the Res = 100 case. At 
this frequency a region of reversed flow is present, since the thickness of the 
oscillating Stokes layer reaches into the buffer layer and the flow reverses near 
the wall during the decelerating phase (without detachment of the boundary 
layer). For lower frequencies such reversal is not observed. 

Different behaviors of the near-wall region as the frequency is decreased are 
evident in Fig. 15. A more dramatic illustration of the same phenomena can be 
seen in Fig. 16, in which contours of the turbulent kinetic energy are shown. 
At the highest frequency the inner and outer layers appear largely decoupled. A 
thickening of the inner layer can be observed at the end of the deceleration phase 
(0/27 t ~ 0.5), which, however, does not propagate far into the outer layer: by 
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z/H ^ 0.2 the contours are nearly undisturbed. At lower frequencies, however, 
the inner layer has the time to adapt to the perturbation introduced by the 
pressure pulse; at the lowest frequencies in particular the flow can be observed 
to relaminarize, as indicated by the absence of turbulent kinetic energy. A shift 
of the more quiescent region of the flow from 0/27 t ~ 0.8 towards (j)/2'K zz 0.5 
can also be observed, which can also be explained based on the increased time 
that the inner layer has to adapt to the outer-flow perturbation. 




Fig. 16. Contours of the turbulent kinetic energy (normalized by the mean wall 
stress) in the oscillating channel. 26 equi-spaced contours between 0 and 12.5 
are shown 



The turbulent eddy viscosity. Fig. 17, adjusts to the unsteady perturbation. 
It is not in phase with the local shear and vanishes as the flow relaminarize 
during the earlier portion of the accelerating phase. This is in agreement with 
results from the DNS concerning the evolution of the turbulent kinetic energy 
production term. 
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Fig. 17. (a) Phase-averaged eddy viscosity (normalized by the molecular viscos- 
ity) at zjH = 0.0265. (b) Phase-averaged dlZ/dz at = 15. (c) Phase-averaged 
mid-channel velocity. Re^ = 100. 



7 Conclusions 

Large-eddy simulations have shown the ability to give accurate prediction of the 
turbulent flow in configurations in which the flow is not in equilibrium, albeit 
in fairly simple geometric configurations. This type of calculation can now be 
routinely carried out on desktop workstations, with reasonable throughput times. 
Parallel computers are required in more complex geometries, in flows in which 
large computational domains are necessary, and in cases in which long averaging 
times are required to obtain converged statistics. 

The next stage in the development of this technique will involve the use of 
LES in more complex geometries. Challenges that need to be met to achieve 
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this goal include the development of energy-conserving, high-order schemes in 
generalized coordinates or on unstructured meshes, and of accurate wall mod- 
els to simulate the near-wall region without resolving in detail the inner-layer 
eddies. Combustion models, and SGS models for compressible applications are 
other areas in which additional research is required to exploit fully the potential 
of LES. Applications in complex geometries, especially those including combus- 
tion, multi-phase flows, or mean-flow unsteadiness, are not likely to be feasible 
on desktop workstations. Memory-intensive problems will also require parallel 
machines. 

Researchers who use large-eddy simulations are typically end-users of the al- 
gorithmic improvements developed by mathematicians and computer scientists. 
A close collaboration between workers in these fields is, therefore, desirable in 
order to achieve some progress in the challenging area of turbulence prediction 
and control. 
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